Skip to content

Congratulations HathiTrust- News from Bibliograhic Wilderness

“Big news, HathiTrust wins in the lawsuit against them by the Author’s Guild (yeah, the same association that are the plaintiff in the the in-limbo case against Google Books).”


The Dangers of Automation in Digital Projects

With digital stuff, it’s easy to think that things can be automated.  You’re dealing with data, and XML.   You’re dealing with computers and scanners.  The temptation is to create scripts that do batches for you.  To a certain extent, this works.  For image processing, for example, batches are great.

The problem comes in metadata.  We have a large collection of digitized books that has been moved from system to system automatically for years.  Mostly, it’s because at different times we put stuff in different places, and then decided to bring the collection together a little at a time.

The end result of all this automated data manipulation and fudging to make things fit?  The collection is a Frankenstein’s monster as far as the metadata is concerned.  Same data in different fields, multiple fields with duplicate data, missing data.  This causes a huge problem for metadata harvesters.

Why are metadata harvesters so important?  Well, we’ve realized that 88% of the people who visit our collections come from external systems that harvest our metadata.  The top one is Google (specifically Google), the next one is our federated search.  The other 12% use the native DSpace interface.  Most of the 12% is us going in and adding and working on collections.  So, most of our users use harvested metadata to find our stuff.  If our metadata isn’t harvestable, or if it’s difficult to harvest, or if the result is confusing to the patron,  then the central means of discovery for our collections are useless.

We have two student assistants doing nothing but going through and making the data all homogeneous.  Making things the same.  We can do this now because the collection is not unmanageable.

So be protective of your item level metadata.  Make sure things are consistent and follow the guidelines for OMI-PMH metadata harvesting.

A misconception about digital collecitons

I’ve noticed a common misunderstanding about digital collections.  Specifically about born-digital digital collections vs digitized digital collections.

The assumption is that when you scan something, the JPEG (PDF, or other similar small file) that you create and put online is the digital item.  The display file has been reduced in quality, usually.  This means that the original scans are probably higher quality and those are the ones you want to protect.

For born digital stuff, the assumption is that the born digital object is the thing to be preserved, but often they are in similar formats to the display formats for digitized stuff.  PDF’s are not a great archival format.  If the file gets corrupted, the whole file is corrupt.  JPEGS are notorious for losing information every time you edit the file.

A better method would be to have a policy that every born digital PDF get exported into TIFF files for archiving, and every JPEG have a TIFF equivalent made for archiving.

I would like it if venders stopped assuming I just wanted to preserve whatever was being displayed to the public.  If they made a system that would automatically extract Tiffs, and move the Tiffs to an archive, I’d be much happier.

Digitization Equipment- see it in person first

With the new fiscal year coming around, our lab is looking at new equipment. We are looking at grants to fund a large format scanner, and looking for future equipment we might want.

With high end digitization equipment, it’s always best to see the equipment in person. This is especially true if the equipment has “automatic” functions of any type. You want to get a feel for how big, how easy, and how fast the equipment actually runs. Stats can only tell you so much. Demo’s are good in a pinch, but really it’s best to see the equipment in real working conditions when things are not perfect. This also means you get a chance to talk to someone who’s used the machine a bit and they’ll tell you it’s quirks.

So, when you’ve narrowed down your search to a few machines, it might be worth the few thousand to send someone to another organization to see the equipment in person. Venders will usually give you a list of people they’ve sold their machines to.

If you can’t see the equipment in person, a good phone call to someone who works with the equipment all the time can suffice. Let them talk as much as they want. The best phone calls and emails about equipment covered the same topics. Here’s a quick guide in asking about equipment:

  • Tell them what you are planning to do. What does your project look like? What do you have to scan? What’s the end product?
  • Tell them what equipment you are looking into, and why you are contacting them. You may be talking to them about scanner X, but you may say your also looking at scanner Y because they may have knowledge of scanner Y.
  • Ask if, in their professional opinion, the equipment is right for your project.
  • Ask what problems they have had with the equipment.
  • Ask what problems they have had with the company (Warrantee, maintenance, etc)
  • Ask what the equipment’s limitations are.
  • Ask what the cycle time for the equipment is (How long does it take to complete an item).
  • Ask them if they have any other recommendations for equipment.

And ask any other questions you may have. Let the person talk as long as they want. What they will probably give you is context. Their digitization works a certain way that will probably be different than yours. You will need to know how their process is different so you can tell if something works well for them because of their setup and not because of the equipment.


Digital Public Library- Digital Hubs Pilot project

The DPLA is starting their Digital Hubs Pilot Project. They will be setting up hubs to ingest content for different regions. I hope, by the time they get to Texas, they choose the Texas Digital Library for our regional hub.

“Each Service Hub will offer a full menu of standardized digital services to local institutions, including digitization, metadata, data aggregation and storage services, as well as locally hosted community outreach programs bringing users in contact with digital content of local relevance. The two-year Hubs Pilot aims to help existing state programs offer these services to all institutions in their state or region. Service Hubs will serve as an on-ramp for every institution in a pilot state or region to participate in the DPLA network. We hope that this model proves successful and that it will ultimately exist in all US states or regions.”

Digital Preservation, the library’s mission?

In academic environments, IT services may be focused on computing power and security. In my experience, it’s very hard to get IT people to understand the concept of Digital Preservation beyond simple backups. In this kind of environment, we are taking the stance that it’s the library’s mission to champion digital preservation to the rest of campus.

I wonder how many other academic libraries are in the same position. They realize their IT resources aren’t giving them what they need, so they start looking into how to do digital preservation themselves, and end up setting a model for the rest of campus.

At CNI this last year, James Hilton found that he could only find 13% of the research done the previous year at his institution. It was just one institution they checked, but I wouldn’t be surprised if other institutions are in the same boat. In ten years, will the research that is being done now still be available? I would hope libraries would step up and start dark archives along with their institutional repositories. That’s the route we are going. If faculty don’t want to publish their work, then they can at least let us preserve it for future generations.

A Challenge of Digital Preservation

In my experience with trying to get a digital archive setup, it seems that one of the greatest difficulties is expressing our needs to our IT people. The idea of a digital archive that does most of the functions of a real archive, only digital, is very different to everything else that IT has to deal with.

We say we need storage.  They ask how much.  We say 100 TB.  They ask how big the files are and how much compression they can do.  We tell them we can’t compress the files.  They look like someone has slapped them in the face.  How often do the files get used, they ask.  The answer is they might never be used again, but we need to continuously verify that they have not degraded, and we may eventually need to access the files.  This also seems to confuse them.

Add that all this is incredibly expensive, and it just makes the conversations worse.  Our organization needs about 50k a year just to keep up with storage for the archive of our digital collections.  That’s a lot to ask of IT resources that are being pulled in a thousand other directions.

The trick is to make sure you know exactly what you need.  It’s hard, because there are thousands of different options for servers and some are cheaper than others.  The idea is to have a firm idea of what your digital archive needs to do.  Express those fundamental requirements, and the discussion should go a lot better.

We didn’t have our fundamental requirements till after we talked with our server people for months.  Most of the meetings were discussing our options and clarifying what was important to us.