Issue 14809

Make full width occurrence downloads

14809
Reporter: omeyn
Type: Story
Summary: Make full width occurrence downloads
Priority: Major
Resolution: Fixed
Status: Closed
Created: 2014-01-17 13:08:42.709
Updated: 2014-03-18 09:37:47.919
Resolved: 2014-03-18 09:37:47.89
        
Description: Bare minimum to go live (all of these end conditions must be met on uat against full size data):

- any columns we provide must be a Term so that it can be provided in meta.xml
- everything in the big download path (no solr->small download checks or path)
- two files in the archive, one for interpreted, one for verbatim fields (core is interpreted and verbatim is an extension), both with rowType occurrence
- one hive table backed by hbase, created from a script (this script must also generate the magic headers file in hdfs that gets merged into final file)
- the hive column names will become the dwca header row
- one set of hive columns will be all verbatim terms (match VerbatimOccurrence fields) with a v_ prefix and another set of hive columns will be the terms of Occurrence, where terms that have been superseded by interpretation are removed (in exactly the same way we do it in our api calls) plus the interpreted java fields on Occurrence.
- when we create the two dwca files, during writing of the verb file we strip the v_ prefix of the header row entries (in meta.xml it's the full term name uri)
- all queries to hive (where clause) go to interpreted columns (so the not v_)
- the hive table that will be queried should live in hdfs (like we do now - recreated every 4? hrs in oozie coordinator)
- workflow updated to create the two final tables from an initial single table query result
- copy & zip needs to build the dwca using the new meta.xml and 2 occ files (core + extension)
- the order of terms/columns in meta.xml must be constant]]>