Check out my new page, “Digital Resources Management,” for a project report that I wrote about a recent cataloging and e-journal records maintenance project. The link opens a PDF document.
My day job right now involves creating a LibGuide called “Online Reference Shelf.” It provides basic reference resources to library patrons, including dictionaries, directories, encyclopedias, search engines, news, biographies, thesauri, travel, and weather information.
One of my tasks is to research the resources, to determine if they are the best, unique, or most relevant to library patrons.
Today I worked on thesauri. To find out what kind of resource Bartleby.com provides in its Roget’s Thesaurus, I staged a trial search. I asked my coworker to tell me a word, any word. When he started thinking about it, I asked him to pick the first one that comes to mind. He chose “infatuation.”
The following are Roget’s results.
Together, I think they do a decent job of defining infatuation. What do you think?
Finally! The recap of the fourth Introduction to Digital Preservation webinar, hosted by ASERL.
[To listen to the recordings and view the power point presentations, see ASERL's archive]
The title of this webinar: “Using FITS to Identify File Formats and Extract Metadata.” It was presented by Andrea Goethals, of Harvard University.
What is FITS?
- “File Information Tool Set”
- Format specifications often have different versions.
- Specifications are things that file formats conform to.
- Authoritative specification information does not always exist for files. Sometimes it can be unclear, complex or long, it can reference other file formats, and can depend on other specifications.
Further complications for tool builders and users
- OpenDoc formats are packaged as ZIP files, which information is not sufficient for preservation.
- Many formats (e.g., XML) are text formats.
- Some formats lack obvious identifying features.
- File formats can be difficult to accurately identify.
- Some are more specific than others (inconsistent).
How does FITS help?
- Combines the functionality of different file format identification tools.
Why build FITS?
- The motivation at Harvard was to offset the risk of accepting any format (including web archives, email attachments, donated external hard drives).
- Additionally, to integrate into existing preservation workflows.
- Strategy: to develop a tool manager instead of a tool, and to account for tool inaccuracy: to check tools against each other, and to verify results.
What is required?
- XML, for tools without a graphics interface tool
What does FITS do?
- Identifies many file formats
- Validates a few file formats
- Extracts metadata
- Calculates basic file information
- Outputs technical metadata
- Identifies problem files (e.g., conflicting opinions on format, metadata values; unidentifiable formats)
- FITS translates tool output to a common XML file type, consolidates them into one FITS XML format, and then translates the FITS XML file to standard XML.
- You can store the FITS XML files wherever you store metadata in the repository.
- The file is not modified during the process.
- The key to using multiple tools
- Assists with tools that provide different names for the same format
- Assists with tools that provide different values for the same metadata
- Assists with tools that provide different ways of saying when they can’t identify the format of a file
[Then we watched nifty demonstrations in Windows as Andrea Goethals took us through what FITS does and how it does it. I discovered I can read basic XML.]
At Harvard University Libraries
- They store metadata in XML form in a metaschema
- Output is parsed and packaged
- Some of FITS data fits well into PREMIS
- Standard metadata block is added
- Other information is included with administrative metadata
Questions & Answers
Q: Are there plans to integrate FITS into large systems/repositories?
A: ArchiveMata uses it. DuraCloud looked into it, but it is mostly used in individual repositories.
Q: Do you need to have the individual File Format Identification tools loaded locally?
A: All necessary tools are downloaded with FITS
Q: When FITS notes conflicts between tools’ results, how do you know which one is right?
A: Conflicts often occur in relatively unused formats. There is an XML file included that can be used to educate oneself, to determine if it is really a more specific version of a broader format. (It provides a format tree).
Where to find FITS
- Download the link
- The mailing list is good for new versions, other news.
This whole series has been incredibly informative. Having listened to these experts talk about common/important tools that they use for digital preservation, I now have a better idea of not only the processes involved in digital preservation, but also how the different pieces fit together. The project planning information was pretty straightforward, and generally, not very different from many projects I’ve worked on in the past or learned about in library school.
Now that I know that some FITS information fits well into PREMIS, and that other information from FITS fits into administrative metadata sections, and that XML can carry them all, I have a better idea of how to use the metadata categories described in the second webinar.
I know which kinds tools are meant to be used for various tasks in digital preservation projects, and I know what I need to learn (and what I don’t) in order to use them. I can point to FITS and PREMIS and say that they may be used in the implementation stage.
Lastly, I know so much more about where to go to find out more about the tools, processes, best practices, and current projects.
I not-so-recently “went to” the third Introduction to Digital Preservation webinar hosted by ASERL (Association of Southeastern Research Libraries).
[To listen to the recordings and view the power point presentations, see ASERL's archive]
This webinar was titled: “Management of Incoming Born-Digital Special Collections. Presented by Gretchen Gueguen, of the University of Virginia.
Without further ado, my notes:
What is born-digital?
There are two layers:
- Supporting software and operating system(s) (OS)
The same software/OS can be used for multiple files.
The Crucial Dependency
- Hardware. Including (but not limited to) ports, wires, ribbons, drives, connectors
- Translation between older and newer hardware can be achieved by write-blockers
Imagine a doughnut.
“Preserve” is positioned in the doughnut hole, smack dab in the middle. Around the edges of the donut (starting on the left and moving clockwise, if you’re curious) reside:
- “Provide Access”
Including old collections and new collections, legacy material (that has already been collected).
Do you further process these legacy collections, or deaccession them?
Appraisal Phase 1: Inventory
(the following list contains information/data you may want to collect in the inventory phase)
- Disk #
- ID #
- Collection name/title
- Record # (MARC or EAD)
- Media type
- Date (from label info)
- Label info
The above information can be used in cataloging, and can help future identification/location of items in the collection.
It may be necessary to:
- Research accession records
- Search the stacks
- Conduct a physical survey of a statistically significant sample of disks
Appraisal Phase 2: Evaluate
- Available resources (work, costs)
- File types and formats present in materials
- Volume of data vs. capacity to take it
- Condition of content: changed? corrupted?
- Dependencies on software, hardware
- Institution’s commitment to the content
- Migration or transformation required?
- Can you appraise/view the intellectual content?
- Policy framework (update it frequently and proactively)
- What capacity do you have for acquiring new born-digital collections?
- How will you deal with certain scenarios?
- Do you need special hardware to read the content?
- Do you have that hardware?
- Does someone else? (e.g., eBay) : Given the scarcity of obsolete hardware, there is a growing interest in sharing equipment
- Is the disk/drive natively Read Only?
- Others (e.g., floppies) are difficult
- Write blockers/forensic bridges: Hardware devices or software that block any writing onto disks (e.g., Tableau, Wiebe Tech; SAFE Block XP, MacForensicLab)
How to transfer the data to a new medium?
1. Disk imaging – one file, bit-level copy
- Captures unused space, sometimes called “file slack,” made up of binary zeros, can take up a lot of space
- Benefits: compact, single file, intact, complete
- Drawbacks: can capture unwanted data, requires specialized tech, can transfer across write-blocker if file is still readable
2. Logical imaging – select what you want and create an image
Transfer using (examples)
- NTFS (New Tech File System)
- MAC : HFS (Hierarchical File System)
- Transfer methods: over a network, using Duke Data Accessioner, or Bagit, or FTP transfer tool such as FileZilla, CyberDuck (how about these names?).
- Web harvesting (e.g., Internet Archive)
- Save to modern media (CD, external hard drive)
- Image the hard drive in person
Is the file corrupted, lost, or changed?
- Checksums. If these haven’t changed, the file hasn’t changed.
- Check for viruses (stabilizing material): Do this in an un-networked space BEFORE uploading the files to a network!
- Search for Personally Identifiable Information (PII)
- Search for duplicate files using checksums.
- Use media inventory
- File inventory of contents (e.g., date, size, file name, type)
- Extract technical, forensic, and preservation metadata (using PREMIS, PBCore, for examples)
- Use a spreadsheet if you don’t have fancy infrastructure to record this information
- Make multiple copies! (Lots of Copies Keep Stuff Safe, heh heh)
- Use repositories or a managed service system for metadata and storage
- If you don’t have one, how will you store and track content? (Spreadsheet and storage database)
Questions (a selection)
Q: Do you have any rules of thumb for materials NOT to accession?
A: The folks at the University of Virginia have not seen anything that they have decided not to take – nothing too unusual. Make sure you have access to the hardware to read the data/content. For some formats UVA doesn’t actually have, they obtained copies of the software from the donor.
Q: Do you manage the bit-stream or physical for commercially produced materials such as DVDs related to other materials?
A: Only physical management at the moment.
Q: Does UVA’s gift agreement contain language for digital preservation?
A: The agreement does state that the donor will agree not to offer the same content to other sources or institutions. It provides information about intellectual property rights. UVA reserves the right to do whatever is needed to preserve the content. Allows donors to ask for access restrictions. It does not contain any statement to the effect that UVA agrees to preserve content via a particular material or for a specific time.
Appraisal and accession are CRUCIAL.
Metadata is important – use checksums, spreadsheets.
Consider consortia – have someone else read the disks you can’t, and vice versa.
Last week (one week after the first webinar), I attended the second Introduction to Digital Preservation webinar, which is hosted by ASERL (Association of Southeastern Research Libraries).
[For more details and to register for the remaining two webinars, see the project webpage]
This webinar was titled: “Forbearing the Digital Dark Age: Capturing Metadata for Digital Objects.” Chris Dietrich, from the National Park Service, presented, with help from Jodi DeRidder and John Berger.
My first impression was that the polls embedded in the presentation slides were pretty neat and convenient, and some day I’m going to learn how to do that.
To get back to the main topic, digital preservation, here are my highlights from the session.
Chris Dietrich described what he meant by the “digital dark age”: the “meantime” while our understanding and preservation strategies catch up to our technology, until which we will remain unable to fully capture our content.
- Is like card catalog information
- Can be managed, shared, preserved
- Is important because digital objects will be with us a long time, and metadata adds value to objects
- With no idea how our digital objects will be used in the future, the more information we provide about these objects, the more valuable they will be
Categories of metadata
- Descriptive – discovery, understanding (i.e. the card catalog information)
- Administrative – management (usually in repositories), e.g. access restrictions
- Structural – storage, presentation, logical/physical components (e.g. HasPart, IsPartOf, IsRelatedTo in Dublin Core)
- Technical – properties of the file itself, instrument settings, e.g. maker and model of camera for photographs, when the object was created
- Other – rights, preservation, geospatial
NB: Chris Dietrich recommended not worrying about whether every metadata fit into its appropriate category – these categories are just guidelines
How metadata is captured
- Advantage: it travels with the file and can be removed from the file
- Not all metadata may be appropriate with the file and/or for certain audiences
- Location is important to pull out before sharing, as it can be sensitive data (redacting can be done when it is logically built into the management system, or it can be entered manually)
- Make sure what is in the file is synced with the information in the repository!
- Is usually in companion or “side car” documents, in spreadsheets, XML
- Is not part of the file or object itself
- Advantage: is easily edited in bulk
- Disadvantage: is easily orphaned or misplaced, separated from the original file/object
- Needs to be synced with the repository and original
Types of photographic files
- jpeg – lower resolution copy
- TIFF – higher resolution copy
- RAW – digital negative (most formats are proprietary, these are often in archives, and are used for analysis)
Types of photographic metadata
- EXIF – technical, descriptive
- XMP (Extensible Metadata Platform) – Adobe product (descriptive); integrated into jpegs – is starting to see wider adoption, used in addition to EXIF
- IPTC – subsumed by XMP – originated with photojournalism
Tools to get at embedded metadata
1. Windows Explorer (ubiquitous)
- Performs batch operations (edit titles, keywords, projects)
- Allows metadata input and editing and file renaming (although this function is clumsy)
- Discovery: basic searching
- Limited functionality
2. Proprietary tools: GPS photos
- Batch operations
- Multiple outputs
- Useful for photography with GPS and embedded coordinates
3. Open source/free tools: IExifPro
- View/edit all EXIF metadata
4. Open source/free tools: Windows Live Photo Gallery
- “Prep and publish” software
- Is a free download
5. Source for open source/free/shareware: Source Forge
Photographic metadata standards
- Dublin Core
- Federal Geographic Data Committee (geospatial metadata)
Required elements (at NPS)
- Title – who, what, where, when
- Create date – born or digitized
- Contact info – photographer/steward
- Access constraints – copyright, privacy
- Constraints info – describe the constraints for access, etc.
- Place description – place name
- NPS Unit Info – local
NB: Save and archive things (master or original copies) at the highest resolution that is manageable, because increasing the quality/resolution of a lower-resolution photograph will not work.
- PDF – an uneditable snapshot, which can contain other types of document, and may not be a document (i.e. sometimes they have photos or videos)
Dublin Core standards
- Flat, flexible, and easy to use
- For any object type
- Can be imprecise
- Simple and qualified implementations, which can be extended to add more specificity
Library of Congress TextMD
- Technical metadata for text objects
Document metadata tools
- Windows Explorer
- MS Word – for individual file metadata editing
- Adobe Acrobat – individual file metadata editing; inherits metadata from Word
- Back up
Last week, as part of my career development strategy I began a series of webinars hosted by ASERL (Association of Southeastern Research Libraries) titled: Introduction to Digital Preservation.
[For more details and to register for the remaining two webinars, see the project webpage]
Firstly, I want to say that I love webinars. They’re hosted by many different types of organizations, such as research libraries (as in this case), professional associations, and vendors (as in the case of the Wikispaces webinar I attended recently), although this does not complete the list. They are often free, or cheap. And I always find them practically educational. Meaning, they provide practical tips and tricks for information professionals (and others).
Secondly, I took some notes during the first webinar in the series, and what follows are the highlights. In future posts, I’ll do the same for the subsequent webinars.
The first webinar is titled “Preservation Planning and Overview of PREMIS for Beginners.” Lisa Gregory of the State Library of North Carolina presented, with Jodi DeRidder and John Berger.
There was quite a lot of information to digest in this session, and I think I’m still digesting the second half, about PREMIS. But it started with a quick definition of preservation and the essential, interwoven, steps for planning digital preservation initiatives.
Preservation is the active management of digital content over time.
- Manage technology that keeps content accessible
- Problem: developing strategies in isolation
- The Signal
- Twitter (search for “digital preservation” and follow industry innovators)
- D-Lib Magazine
- Empty the institution’s “pockets”: Where is all the institution content shared/saved/kept?
- Preservation statements: see http://www.dlib.org
- SCORE – log in as a guest; will need knowledge of digital preservation language; statement for individual/institution
- TRAC – measures repositories against OAIS standards; assesses “trustworthiness,” areas: organizational infrastructure, digital object management, technologies and technical infrastructure and security
- Planning, best practices
- Write down the proposal for stakeholders
- Write down the results of the assessment
- Write down workflows articulation
- Be transparent
- Bring in as many of the involved parties as is feasible.
- Review it!
- Preservation Policy Template (MetaArchive)
- Digital Preservation Management workshop for management aspects of digital preservation, including administrators, policies, procedures
- Keep an open dialog
- It is less risky to bring people in on the front end, than it is to bring them in at the end.
- Obsolete media
- Is what you take in – does it match your resources?
- Is it organized according to a system you’ve defined?
- Do you have enough resources to manage all content?
- Consider security, access, permissions.
- Do you have the staff expertise and equipment to extract or work with the data?
- Can you determine if the content is relevant to the institutional/organizational scope?
- How do you position yourself for the administrative buy-in?
- What could reasonably be improved within 3-5 weeks?
- Are there other stakeholders headed your way?
- What small improvements can be made?
- What can be done to start recording and improving preservation metadata?
- Is there a data dictionary or guidance document for that metadata?
OAIS Reference model
- Is ubiquitous in digital preservation language
- Is a way to describe in a non-software-specific way the best functions of a digital preservation management system
- Is a hefty, dense document
- Can be studied in scholarly and professional articles about digital preservation
One attendee asked how does one know when one has enough information? Lisa Gregory and Jodi DeRidder both suggested doing some research and then getting started. Talk to similar institutions. It’s not feasible to wait until one has all the answers. Start by inventorying assets, defining the scope, and doing bit-level storage, making copies. Get buy-in at every step, one step at a time.