I suspect there will be a whole series of posts with the subtitle “A Manifesto” as I browse through my old emails, grad school papers, etc. and want to get the word out. I’ll try to give them all the category of “LibrarianshipManifesto” so they are easy to find!
Intro: The question arose whether I could get better technology to make my everyday work with big (messy, legacy) data more efficient. The initial ask was for more RAM because my boss got that, and it worked. The initial response was, “can’t you just do this other thing instead?” A broader audience considered providing the same tech other big data folks have, with the matching price tag. The payer’s response was, “that’s exorbitant. Can’t you just do that other thing I suggested?” I drafted this manifesto in response. Edited a tiny bit to anonymize but still maintained the pertinent details that should make sense to those in my field of practice.
Here is what I am trying to do and the steps involved. It is critical to note that, in the face of a data migration where everything will need to move in large quantities, I will be far from the only person who could benefit from the right technology being available.
- Each month, we get a full draw of the bibliographic records in the Sierra database from IT. It comes in 7 or 8 .mrc files, divided according to the Sierra record number. That takes some time.
- Using MARCEdit’s MARCJoin function, I create a complete .mrc file of the entire bibliographic set that comes to about 6 gigs. That takes some time.
- That complete 6 gig .mrc file is uploaded to a vendor via FTP for access. That takes some time.
- I can use MARCEdit’s MARCBreaker to transform the complete .mrc file into a .mrk file. That takes a significant amount of time.
- I can have MARCEdit sort the contents of the .mrk file according to such data points as:
- control number – MARC 001, the data point necessary for every one of our tens of thousands of eResource overlay, and one that Sierra relies on to prevent duplicate records;
- title – the other one that Sierra uses for preventing duplicate records;
- author; and
- call number.
That takes a significant amount of time, and MARCEdit doing that and then needing to validate (MARCValidator) and save the sorted file has been impossible.
Yes, it is possible to do the same with the smaller, original .mrc files that come from IT. But, they are still split according to Sierra’s record number and not any other data point. Yes, it is possible to sort all of them separately. But, then, they need to be compared side by side to find the desired or duplicated records in all of the files. It is also possible to transform the whole thing into a .csv but working with that many lines of a spreadsheet is not much easier.
Sorting the complete .mrk file but control number, for example, is what made it possible to get started on reconciling missing or blatantly incorrect control numbers. It is also desperately needed to track down all of the unexpected, incorrect overlays that happened over the last decade. But, without being able to save it in that state, I have to wait for it to sort from scratch each time. The collection with which I need to work to determine what’s been done, what’s changed over time, what needs to be done, and how to standardize the overall processes for loading electronic resources is significantly larger than the only other way to collocate records: the “Create Lists” function in Sierra (limit is 500,000, and the full eResource collection is likely over 600,000). Sorting by title will be helpful for finding multiple records for the same title that use different control numbers and clearing that up pre-migration. Sorting by the author will allow us to update and correct name access points to make it possible for our users to find everything by a given author (Cutter, 1876). Every ILS we are considering, and every possible discovery layer we could put on top of it, has the “virtual shelf browse” in the user display, which relies on the call number—sorting the complete .mrk file by call number will allow us to facilitate that feature effectively for users looking for content by subject area (also Cutter).
We have the choice of whether to put in the effort and time/labor to give the new ILS better data or put in more effort and time/labor to fix not only the data problems inherent to a data migration but also the data problems we could have prevented. Data creation and management in libraries is not just about proofreading and data entry! It’s about enabling discovery and access and fixing it when broken.
Disclaimer:
All words and images are my own. If they are not, they are cited as such to give proper attribution to the intellectual property owners.
No words or images reflect the opinions or viewpoints of my current, former, or future employers and educational institutions. They are from my own viewpoint.
