Import/Export Procedures
6/15/2001
This section outlines the procedures used by the AMICO
staff to gather and process all of the data to be distributed annually
as part of The AMICO Library.
Unless otherwise noted, scripts are found in /amico/bin/
1.1) Receiving/Validating
Images
AMICO member institutions who are contributing works
for the coming AMICO Library year are instructed to send, by a fixed
deadline, media containing the image and multimedia files for those
works. CDs are the preferred media, but other forms of media are acceptable
if first approved by the AMICO technical staff.
When media is received by AMICO, the following procedure
is followed:
- An
acknowledgement email is sent to the contacts for the member institution
- The
media is physically labeled according to that members AMICO
member prefix, in the following form: MEMB.YEAR.MEDIA# (ex LACM.2001.01).
The date received is also written on the media itself.
- The
media and date received is logged on paper.
- As
early as possible, the files from the media are loaded onto the
server into a dir named after the media in /amico/scratch
- Command
for mounting a cd to an empty dir DIR:
mount /dev/cdrom DIR
- The
files are then examined to be sure that the files are:
Named correctly: files must be prefaced by the
member prefix and not contain any illegal characters
Correct format: files must be TIFFs
- acknowledgeMedia.pl
is run on the files. This script will create a row in the media
table for this media and a row in the file table for each tiff file
contained on the media and will thumbnail the images. Thumbnails
are stored automatically in /home/httpd/html/amico/apw/thumbnails/MEMB.
- acknowledgeMedia.pl
MEMB MEDIANAME SOURCEDIR comments
MEMB is the 4-char member prefix
MEDIANAME is the label given to the media as outlined
above
SOURCEDIR is the working dir in /amico/scratch
where the media is currently stored
comments must be in quotes and would
contain any special notes regarding that media (ie media is a
replacement for an earlier submission, filenames were changed,
etc)
Once the acknowledgeMedia script is run, there
will be a media log entry for that media in the CMS, under View
Files->View Media Log. Members are asked to view these logs
to verify that image orientation and quality meet expectations.
- If
metadata creation has been requested by the member, a custom script
for that member is made based on the mkmeta.base.pl template script,
with the XPU, XRI, and XRS set to the values supplied by the member.
The script is then run in the dir for that media on the server:
mkmeta.MEMB.pl > MEDIANAME.mkmeta.log
The log shown above (MEDIANAME.mkmeta.log)will indicate
whether the script was successful, then the resulting output.meta.txt
file is then given a name appropriate to that media and moved to
/amico/work-space/incoming/MEMB and validated through the CMS.
Procedure is repeated for all media received until
it is time for export processing.
1.2) Receiving/Validating
Multimedia
1.2.1) Member-Submitted Multimedia
Multimedia is received from members in the same form
as images, but the procedure is different as there is no program in
place to validate or thumbnail multimedia files. When a multimedia
file is found on submitted media, that file is played to verify that
the file is not truncated or corrupted. The file is then copied to
/amico/data/multimedia. When the time for export arrives, all of the
multimedia files are checked for valid metadata records and those
with valid metadata are moved to /amico/data/totape.
1.2.2) Antenna Audio Files
Antenna Audio files are sent on media directly from
Antenna to AMICO. When the media are received, they are labeled similarly
to the other media, but with added notation to indicate that they
are Antenna media (ex. MIA_.AA.2001.01). Files are matched to catalog
records based on lists provided by Antenna and annotated by the members.
Next, the files are sampled, appropriately named, and recorded in
the file database and saved in /amico/work-space/incoming/MEMB/audio/.
At this point metadata is created by the sampling process populated
with data from the sampled file and the Antenna/member list. These
files can now be reviewed by members with the Antenna tool in the
CMS. This tool allows the member to listen to the sampled file, view
the catalog record, and create the related multimedia description.
Once members have reviewed all of these elements, they must accept
or reject each link.
If approved, a Related Multimedia Group (RMG)is added
to the catalog record using the values entered by the member. The
status row in the file table for that audio file is changed to Approved
and the file is moved from /amico/work-space/incoming/MEMB/audio/
to /amico/data/totape.
If rejected, no changes are made to the catalog record
and the status row in the file table is changed to Deleted,
and the file is immediately moved to /amico/data/MEMB/DELETED and
will no longer appear in the Antenna linking tool.
1.3) Receiving/Validating
Data
Data for submissions can be received in one of two
ways:
- AMICO data text records
following the AMICO data specification
or as
- AMICOlite tab-delimited
text files.
Proper submission procedure for either of these file
types is to upload via the File Submission menu of the CMS either
using browser Browser upload is preferred to ftp because it ensures
that we have files in the correct directory.
1.3.1) AMICO data files
Data files, as described in the AMICO data specification
on the AMICO website (http://www.amico.org/AMICOlibrary/dataspec.html),
can be validated by the member via the View Files or Validate Text
files options in the CMS. If AMICO staff choose to validate or revalidate
these files, the CMS routines are also used.
The browser calls the following script:
The script pulls tags from the data file, separated
by the }~ delimiter. Before doing anything, the script checks whether
or not the AID tag matches any existing AID in the database. If there
is a match, it is assumed to be a record that is being updated. Data
pulled from the data file will overwrite any preexisting data in the
database for a given tag in that record. AIDs that do not match any
existing AID in the database are assumed to be new data and new rows
will be created in the database as the data from the tags is loaded.
Records that are loaded have the following information
added to them.
Warning/Error Messages APD field
Catalogstatus - user and timestamp
Validated files are logged in the file table
with the following entries:
filename
datevalidated (timestamp)
size (in bytes)
type (Text in this case)
status
Resubmitted files must be renamed to prevent
confusion with database records of previously validated files, even
if the previous file did not validate successfully.
If a member is having problems validating a file not
attributable to data, the file is examined for problems. Some common
problems are missing newline characters, characters from other filesystems
(eg. DOS and Mac newlines), and missing or broken delimiters.
Files not yet validated are stored in /amico/work-space/incoming/MEMB/
and validated files are stored in /amico/work-space/saved/MEMB. The
file is then revalidated and the member informed via email of the
changes made.
Validated records are stored in the database and can
subsequently be viewed and edited with the CMS editing tools.
1.3.2) AMICOlite data files
AMICOlite files are tab-delimited text files constructed
as described in the AMICOlite specification on the AMICO members website
(http://members.amico.org/comm/tech/
AMICOlite.htm). The specification must be followed closely, as field
order is very important in these files. Members must inform AMICO
staff of submission of AMICOlite files, as these cannot be validated
interactively. These files must be run by staff through the parseAMICOlite.pl
script to convert tab delimited files to the AMICO Data Specification:
parseAMICOlite.pl FILE > NEWFILE
The new file, which is an AMICO data text file is
then examined to be sure that the tags and data have been aligned
correctly. The new file is then moved to the members incoming
dir and validated through the CMS.
1.4) Data
Processing
At this point the AMICO editorial and technical staff
work together and with the members to correct as many data problems
and invalid records as possible before the time arrives for the annual
export to distributors.
Editorial tools used in this process include:
otyNormalize.cgi helps members fix invalid
Object Types.
dateNormalize.cgi helps members index dates.
fieldFormat.pl removes leading and trailing
spaces, tabs, single/double quotes and standardizes capitization for
a specified data value.
AMICO Library records are exported from the CMS to
send to AMICO Distributors and to release via the AMICO Thumbnail
Catalog:
2.1) Image/Multimedia
Export
Moving one by one through the dirs in the /amico/scratch
tree for media submitted for the current library year (the dir naming
scheme contains the year, so this is simple), the following script
is run in each dir:
writeTiffHeaders.pl /amico/scratch/MEDIANAME >
/amico/data/log/MEMB.writeTiffHeaders.log
This will write AMICO rights and record data to the
header of each tiff file and place the resulting file in /amico/data/totape.
This script also verifies that the file has a metadata record and
is linked to by a valid record. Files that fail this validation test
are moved by the script to /amico/data/unlinked. Files that fail this
procedure due to factors unrelated to the database will be noted in
the log.
Files from /amico/data/totape are written to tape.
This can be done at any point in the process of moving through the
media dirs but should be done before the combined size of the
files in /amico/data/totape exceeds the size of a single tape (less
than 70GB at this time on our DLT7000 drive). A blank tape should
be loaded and labeled at this point and the density set on the tape
drive to ensure maximum size will be used, unless a distributor has
requested a specific tape density.
The tape write command is:
cd /amico/data/totape
tar cvvf /dev/nst0 * >> /amico/data/log/TAPENAME.log
Once all of the files in the directory have been written
to tape, the files are moved into /amico/data/MEMB, MEMB being the
member prefix for any given file. Overwrites should be confirmed in
this move to allow for replacement images from members. Once a tape
is full, a log sheet should be made for that tape, recording contents,
density, and date completed.
These tapes are shipped to distributors as a set for
a given library year.
2.2) Data
Export
In a file, write an SQL command to get a list of AIDs
for the records to be exported. The following example would result
in all valid 2001 records for export:
SELECT aid FROM catalog WHERE aid not like TEST%
and ALY = 2001 and avv > 0 and aly != 0
and del not like Y order by AID
With that command placed in a file called AID2001.sql,
run the following command:
sqsh U amico I AID2001.sql o AID2001.txt
AID2001.txt will contain the AIDs for the records
to be exported, and should be edited to remove the extraneous SQL
output at the beginning and end of the file.
Next run the following script:
exportFullCatMet.pl AID2001.txt > AMICO.2001.full.txt
AMICO.2001.full.txt will contain the full export in
AMICO data format, containing both data and metadata.
To export to XML, run:
xml.pl <full text datafile> <list of AIDs>.
The script will create an XML directory with a folder
for each member containing all corresponding XML files. Parsing errors
are placed in a file called parsing_errors.txt in the same directory
as the XML folder.
AMICO data exports, including the XML export, are
placed in /home/export and may be downloaded at any time by distributors
by using their individual ftp logins. These distributor accounts are
ftp-only and all point to /home/export as their home.
The exports are named to indicate the date of export
and will all be available indefinitely to allow complete and accurate
reconstruction of the data.
4.1) Backups
Weekly backups of the update.amico.org server are
automated and happen weekly. A backup tape should be placed in the
DLT drive for the weekend and rewound. A cron process drops a database
dump in /extra and backs up the following directory trees to tape
late Saturday nights:
/extra
/home
/amico/work-space
/amico/data is backed up once for a set of local
masters following the completion of a years library export.
4.2) Data
Export
On Thursdays a text file containing data changes from
the prior week is prepared for distributors. The following script
is run:
exportChanges.pl MM/DD/YY > /home/export/AMICO.update.YYYYMMDD.txt
where MM/DD/YY is the date when the last update file
was created.
The update file will then be available for all distributors
via ftp on update.amico.org using their respective ftp logins.
The AMICO public web database (APW) must also be updated
at this point. Tar the thumbnails in /home/httpd/html/amico/thumbnails
and ftp them to search.amico.org. Then extract this tar file to /home/httpd/html/amico/apw/thumbnails
on search.amico.org. Next, run the following set of commands on update.amico.org:
cd /home/export
grep v DELY AMICO.update.YYYYMMDD.txt > updates.YYYYMMDD.txt
grep DELY AMICO.update.YYYYMMDD.txt > deletions YYYYMMDD.txt
ftp to search.amico.org and upload those two files
to /home/updates
login to search.amico.org and run the following commands:
cd /home/httpd/html/amico/apw/search/admin/cmd
importAPWbatch.pl /home/updates/updates.YYYYMMDD.txt
deleteAPWbatch.pl /home/updates/deletions YYYYMMDD.txt
The APW will now be up to date for the week.
The CMS needs to be picked apart and checked for accuracy
at all levels given the statistical inconsistencies weve been
seeing lately.
Additional enhancements need to be made to the script
that validates records:
Date indexing
Check member prefix
Deny overwrites for specified protected fields
Autocorrect duplicate file names, warn of change
More to come
Automate weekly data exports, including APW
Not surprisingly, times required for any routine involving
database access are growing noticeably longer as the size of the library
grows. Certain scripts involved in import, search and export would
benefit from a rewrite in C.
The record linking reports are broken in some cases
and, even when working, are unclear as to the nature of a problem
with a given record.
Distribution tape writes should be automated.
Many ADP field entries could probably be removed from
the database, first saved externally of course. This would cut dump
file sizes and improve db performance when editing.
Need to update load scripts/ record editor scripts
to purge redundant entries
There is a definite need for a set of tools for simplifying
certain large-scale direct database changes we commonly see, like
rights updates, preferably accessible through the CMS.
CMS browser issues should be definitively addressed.
An entry should be added to the database to note when
a given file has been sent to a distributor, preferably as part of
the tape writing automation mentioned above.
(Low priority) a Win32 client for record editing would
be very useful.

Last modified on
October 10, 2001