Issue 18656

OccurrenceHDFSBuild Oozie workflow fails

18656
Reporter: cgendreau
Type: Bug
Summary: OccurrenceHDFSBuild Oozie workflow fails
Priority: Unassessed
Resolution: Fixed
Status: Closed
Created: 2016-07-22 17:05:52.883
Updated: 2016-07-25 14:26:55.774
Resolved: 2016-07-25 14:26:55.586
        
Description: {code}
Error: java.io.IOException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Mon Jul 18 05:28:16 CEST 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=60303: row 'H�P�' on table 'prod_a_occurrence' at region=prod_a_occurrence,H\x8FP\xC8,1450266127894.9051d9b1d8afe30d6eee274475e36a49., hostname=c4n10.gbif.org,60020,1467034222715, seqNum=6526259527
{code}]]>
    


Author: cgendreau
Created: 2016-07-22 17:11:14.677
Updated: 2016-07-22 17:11:14.677
        
I increased the timeout to 120000 ms (https://github.com/gbif/occurrence/commit/eb04ff6b3e241f8468173e6c4049054480a1dd20) to see if we could get further. The answer is no.

I had a look at :
- hbase hbck -details
- hdfs fsck -list-corruptfileblocks
- logs on c4n10.gbif.org

I found nothing suspicious.

The only hint we might have so far is 'c4n10.gbif.org,60020,1467034222715' which appears in the last 3 attempts:
{code}
region=prod_a_occurrence,)\x80\xB0=,1407854568231.405ad2dbf0655b5625c7f72b54c4e40f., hostname=c4n10.gbif.org,60020,1467034222715, seqNum=5733538129
region=prod_a_occurrence,H\x8FP\xC8,1450266127894.9051d9b1d8afe30d6eee274475e36a49., hostname=c4n10.gbif.org,60020,1467034222715, seqNum=6526259527
region=prod_a_occurrence,H\x8FP\xC8,1450266127894.9051d9b1d8afe30d6eee274475e36a49., hostname=c4n10.gbif.org,60020,1467034222715, seqNum=6526259527
{code}

    


Author: cgendreau
Comment: Worked after running a HBase major compaction
Created: 2016-07-25 14:26:39.865
Updated: 2016-07-25 14:26:39.865