Data Cleanup

This page will provide guidelines and instructions for cleaning up data, such as importing a new collection to Arctos or incorporating electronic data into an existing collection.

Nothing here should be taken as step-by-step instructions, but rather as general guidelines. The Arctos DBAs can help with any step of this process, and often have proven tools, scripts, and workflows. The general steps will be:

  1. Create Agents as needed
  2. Create Taxonomy as needed
  3. Enter Accession information into Arctos
  4. Great Geography as needed
  5. Load specimen records
  6. Loan anything else, such as Citations, Loans, additional Identifications, etc.

Organizing Data

Denormalized data (e.g., “you and me” and “you” and “me” and “me & you” all in various Agents columns) remains the most manual and time-consuming step of this process. These data will need dealt with on a case-by-case basis. Identify such data and formulate a plan to normalize them very early in the data cleanup process. The rest of these instructions will assume that this has been dealt with.

First, transform the data into the Arctos Bulkloader format. Note that the bulkloader is datatype-agnostic and will accept (almost) any data; the contents of the data are fairly unimportant at this stage, but the data should be in the correct column.

Load the data to Arctos through the specimen bulkloader. Choose the “Push to pre-bulkloader” option.

Follow the instructions. You will eventually end up with a bunch of files containing two columns, the original data and “shouldbe.”

In the spreadsheet or application of your choice, fill in the lookup tables. For example, agents “Some Guy,” “Some R. Guy,” and “Guy, S.” might all be mapped to shouldbe = “Some Random Guy.”

Use the “shouldbe” columns to create new authority data through Arctos tools. For example, Agent first, middle, and last names may be extracted through the Agent Name Splitter, the results of that may be validated through the Agent Bulkloader, and any new agents may be created. This process is very likely to result in the discovery of potential-duplicate agents, which much be substituted into the “shouldbe” column (or have alternative names created). That is, Arctos “learns” from the introduction of new data, and the results of that process may change the lookup values. This is an interactive process; do not expect the first attempt at cleanup to be the final pass.

Once all lookup tables are completed, load them back to the pre-bulkloader. This will replace all e.g., “Guy, S.” with “Some Random Guy” in ALL agents columns (of which there may be very many).

Use the pre-bulkloader to fill in any missing defaults.

Push to Bulkloader, load the data.

— the following is pre-pre-bulkloader; the process remains valid, but new tools provide a simplified approach —

Organizing Data

It is generally a good idea to organize specimen data somewhat like the bulkloader as early as possible. If you have a RDBMS with which you are familiar, this may be better done as a last step, but we generally avoid anything except flat files, relational systems tending in practice to be less than optimally organized. Don’t worry about data quality at this stage, the goal is simply to get the data which will be loaded with specimens into a share-able format.

Agents

Extract distinct values of Agents, regardless of their role. Collectors, preparators, identifiers, etc. should all be merged into a single column in a single table.

Agents are often stored as concatenations – “John Smith and Jane Smith.” Arctos agents are data objects, so these must be separated. Add as many columns as necessary (agent_1, agent_2, etc.) and copy/paste/script/magic the agents apart. Order is unimportant – at this stage, “John Smith and Jane Smith” becomes agent1=John Smith; agent2=Jane Smith, and “Jane Smith and John Smith” becomes agent1=Jane Smith; agent2=John Smith. Do NOT change, standardize, or alter any data at this point – exact string matches are critical to repatriation.

Merge the agent_n columns and again extract distinct values. Add a second column to this spreadsheet, “agent_should_be.” Go through them carefully, preferably with someone familiar with the collection, standardizing terms into the new column. Do NOT alter the original namestring in any way. If J. Smith, John Smith, and Jonathan Smith are all deemed to be the same person, the final data should look something like this:

Screen Shot 2013-09-10 at 1.33.11 PM

Extract unique values from the “shouldbe” column. Run them against the Arctos agents, and eliminate any existing agents. You may wish to edit the Arctos records of existing agents – ask a DBA for help if the bulkloader is insufficient. Add necessary columns and load the new agents to Arctos. A template and cleanup tools are available.

Add necessary columns to the original data (e.g., collector_agent_1), join to the cleaned-up file, and insert the cleaned-up agent information. We usually leave the original data in the file to serve as a check.

Taxonomy

Approach Taxonomy much like Agents: Extract unique values, create new values in Arctos as necessary, add a column for the cleaned-up values, and use a lookup table join to repatriate cleaned and standardized Identifications.  See http://arctosdb.org/how-to/create/bulkloader/#taxa for more information on formatting formulaic Identifications, and talk to the DBA folks about bulkloading new taxon names.

Geography

The first steps are similar to Agents and Taxonomy: Extract unique values from the original data. The Arctos bulkloader works with the concatenated higher_geography string, and it may be necessary to concatenate your data