NEW Big Data Transfers - Sharing Multi-terabyte Image Collections
Instructors: Doug Peterson, Pixel Acuity, Isabel Meyer and Asher Akhtar, Smithsonian Institution
Duration: 4 hours, half hour break from 12:15 to 12:45 as well as two 15-minute breaks
Course Date: Thursday 17 June
New York: 10:00 - 15:00
Paris: 16:00 - 21:00
Prerequisites: Some basic experience with Python and a reasonably fast internet connection (10 Mbps up/down) as we will be downloading and uploading a few hundred megabytes of files during the hands-on section of the class. For those without prior Python experience, we suggest watching the free DT Coding Series, Part 4 Getting Started with Python.
This course enables the attendee to:
- Understand three common methods for faciliting making or receiving requests to transfer large batches of images into or out of an institution.
- Increase existing programming experience with detailed implementation knowledge.
- Understand the range of options and pros and cons, which will help those without programming experience to navigate this issue with the IT of their institution.
Intended Audience: IT professionals at CH institutions who facilitate requests from researchers, vendors, or other institutions to receive very large file deliveries of data, for example gigabytes or terabytes of images or movies. Non-IT staff that receive or send such requests as it will familiarize them with the range and pros/cons of the available options.
Course Description: Big Data projects often require Big Data Transfers, and in the era of COVID this often must be accomplished without being on-site. Pixel Acuity and Smithsonian are working on an AI project that required the transfer of several terabytes of images and metadata. We explored three methods: Dropbox, web API, and an Amazon S3 Bucket.
In the first hour, we present our findings. We start with discussing why such data transfers may be requested and provide a brief overview of relevant technology and terminology. We then review the technical and practical details of each data transfer method we evaluated, including the entire process: from considerations in preparing large image volumes for export from a digital repository, to the export process itself, and concluding with options to confirm the completeness of the transfer.
In the second hour, we do hands-on walk-throughs of the following components of these methods:
- Syncing data to an Amazon S3 bucket using Transmit App
- Writing Python to download images from Smithsonian and NYPL web API
- Writing Python to remotely validate MD5 checksums of images in an S3 bucket
- Using CloudSync to transfer images from FreeNAS™ file storage to an Amazon S3 Cloud
Doug Peterson is the head of research and design at Digital Transitions, which includes DT Heritage (digitization hardware) and Pixel Acuity (digitization services). He has a BS in commercial photography from Ohio University and has been programming since 6th grade. He is the lead author of the DT Digitization Guide series and has previously presented two short courses at the IS&T Archiving Conference.
Asher Akhtar is a Smithsonian Backup Engineer/Storage. Akhtar has been working full time with Smithsonian since April 2014. His primary responsibility with Smithsonian includes architecting and managing Smithsonian data backup environments. Akhtar has been a Unix admin, storage admin and backup admin for over 20 years with various organizations.
Isabel Meyer is the versatile branch manager responsible for the Smithsonian Institution's Enterprise Digital Asset Management System (DAMS). She joined the Smithsonian's Office of the Chief Information Officer in 2003 with more than 20 years of proven leadership and experience in the technology and digital media industry. She combines technical knowledge with her skills in building and strengthening relationships across all levels of an organization to achieve consensus and deliver solutions in complex environments.
Member $ 135
Non Member $ 150
Student $ 75
For office use only: