Error > java.io.IOException: Incompatible clusterIDs

El otro dia, limpiando espacio de logs en el clusters de hadoop nos dimos cuenta de que alguien habia cometido el tipico error de siempre….el . delante de /….faltaba. Imaginaos, la carpeta /hadoop volatilizada en el Namenode principal.

Esta carpeta, es la que almacena los fsimage de los NN principales, que no es mas que un una carpeta donde estan los siguientes ficheros:

  • fsimage > Estado al completo del sistema de ficheros en un momento dado
  • edit* > Modificaciones sobre ese estado

Por ejemplo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_>/hadoop/hdfs/namenode/current$ ll
total 359176
-rw-r--r-- 1 hdfs hadoop 207 Sep 24 13:10 VERSION
drwxr-xr-x 4 hdfs hadoop 4096 Sep 24 18:45 ..
-rw-r--r-- 1 hdfs hadoop 39055540 Oct 7 07:36 edits_0000000000104665286-0000000000104896475
-rw-r--r-- 1 hdfs hadoop 39773003 Oct 7 13:36 edits_0000000000104896476-0000000000105131255
-rw-r--r-- 1 hdfs hadoop 39665351 Oct 7 19:36 edits_0000000000105131256-0000000000105365121
-rw-r--r-- 1 hdfs hadoop 38588743 Oct 8 01:36 edits_0000000000105365122-0000000000105593176
-rw-r--r-- 1 hdfs hadoop 37966508 Oct 8 07:36 edits_0000000000105593177-0000000000105818124
-rw-r--r-- 1 hdfs hadoop 63070696 Oct 8 07:36 fsimage_0000000000105818124
-rw-r--r-- 1 hdfs hadoop 62 Oct 8 07:36 fsimage_0000000000105818124.md5
-rw-r--r-- 1 hdfs hadoop 40066563 Oct 8 13:37 edits_0000000000105818125-0000000000106056329
-rw-r--r-- 1 hdfs hadoop 10 Oct 8 13:37 seen_txid
-rw-r--r-- 1 hdfs hadoop 63247856 Oct 8 13:37 fsimage_0000000000106056329
-rw-r--r-- 1 hdfs hadoop 62 Oct 8 13:37 fsimage_0000000000106056329.md5
drwxr-xr-x 2 hdfs hadoop 4096 Oct 8 13:37 .
-rw-r--r-- 1 hdfs hadoop 6291456 Oct 8 14:28 edits_inprogress_0000000000106056330

Acto seguido, empezaron las alertas: los trabajos no podian seguir porque no sabian donde guardar en HDFS, los reportes no salian por que no podian ejecutar las queries, la gente de marketing saltando por la ventana por no podian tener el reporte de los ultimos 5 minutos, etc etc etc.

Total, tuvimos que parar el cluster, enviar mensajito y empezar a ver como podiamos solucionar esto.

Esto es mas o menos, lo que nos salia en los logs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
2018-09-24 09:22:59,184 WARN  common.Storage (DataStorage.java:loadDataStorage(449)) - Failed to add storage directory [DISK]file:/grid/2/hadoop/hdfs/data/java.io.IOException: Incompatible clusterIDs in /grid/2/hadoop/hdfs/data: namenode clusterID = CID-b4378b9c-f92a-4249-8161-e83d88737790; datanode clusterID = CID-38ff4b3a-cf46-4d65-860f-43d398eb6798        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:801)        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:322)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:438)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:417)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:595)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-nn-1-n2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-

2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-
1ava.io.IOException: All specified directories are failed to load.
2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw--1-nn-1.knockout.local/10.25.12.100:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:596) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543) p-dw- at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)

2018-09-24 09:22:59,188 WARN datanode.DataNode (BPServiceActor.java:run(804)) - Ending block pool service for: Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hddp-dw-1-nn-1.knockout.local/10.25.12.100:8020
2018-09-24 09:22:59,291 INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb)
2018-09-24 09:23:01,292 WARN datanode.DataNode (DataNode.java:secureMain(2699)) - Exiting Datanode
2018-09-24 09:23:01,296 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
2018-09-24 09:23:01,300 INFO datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:

2018-09-24 09:22:59,184 WARN common.Storage (DataStorage.java:loadDataStorage(449)) - Failed to add storage directory [DISK]file:/grid/2/hadoop/hdfs/data/java.io.IOException: Incompatible clusterIDs in /grid/2/hadoop/hdfs/data: namenode clusterID = CID-b4378b9c-f92a-4249-8161-e83d88737790; datanode clusterID = CID-38ff4b3a-cf46-4d65-860f-43d398eb6798 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:801) at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:322)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:438)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:417)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:595)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-nn-1-n2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-

2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-
1ava.io.IOException: All specified directories are failed to load.
2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw--1-nn-1.knockout.local/10.25.12.100:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:596) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543) p-dw- at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)

2018-09-24 09:22:59,188 WARN datanode.DataNode (BPServiceActor.java:run(804)) - Ending block pool service for: Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hddp-dw-1-nn-1.knockout.local/10.25.12.100:8020
2018-09-24 09:22:59,291 INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb)
2018-09-24 09:23:01,292 WARN datanode.DataNode (DataNode.java:secureMain(2699)) - Exiting Datanode
2018-09-24 09:23:01,296 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
2018-09-24 09:23:01,300 INFO datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: in /grid/2/hadoop/hdfs/data: namenode clusterID = CID-b4378b9c-f92a-4249-8161-e83d88737790; datanode clusterID = CID-38ff4b3a-cf46-4d65-860f-43d398eb6798 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:801) at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:322)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:438)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:417)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:595)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-nn-1-n2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-1-

2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw-
1ava.io.IOException: All specified directories are failed to load.
2018-09-24 09:22:59,188 ERROR datanode.DataNode (BPServiceActor.java:run(780)) - Initialization failed for Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hdp-dw--1-nn-1.knockout.local/10.25.12.100:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:596) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1543) p-dw- at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1504) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
at java.lang.Thread.run(Thread.java:745)

2018-09-24 09:22:59,188 WARN datanode.DataNode (BPServiceActor.java:run(804)) - Ending block pool service for: Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb) service to hddp-dw-1-nn-1.knockout.local/10.25.12.100:8020
2018-09-24 09:22:59,291 INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid 3d92f67c-46cc-471f-aa66-d717beed08fb)
2018-09-24 09:23:01,292 WARN datanode.DataNode (DataNode.java:secureMain(2699)) - Exiting Datanode
2018-09-24 09:23:01,296 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
2018-09-24 09:23:01,300 INFO datanode.DataNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:

Vamos por partes, porque el error puede llevar a equivocarnos : java.io.IOException: Incompatible clusterIDs

Normalmente, este error es porque hemos cambiado el id del blockpool y ahora el NN principal, no lo reconoce. Eso se suele arreglar re-formateando el blockpool y listo. Tarea rutinaria. Pero aqui no es ese el problema: Nuestro NN ha perdido toda la informacion de fsimage y edits y no sabe que tiene y que no. Como veis, es un problema totalmente distinto.

Si fuero, lo primero, la solucion pasaria por hacer algo parecido a esto

1
> <<< CORP >>> hdfs@hdp-dw-1-nn-1:$_> hdfs namenode -format -clusterId CID-[ CID NUMBER ]

Pero, no es nuestro caso: Nosotros tenemos que:

  1. Perder todos los datos y arrancar nuestro datalake de 0
  2. Recuperar el fsimage de algun lugar

Recuperar fsimage del Namenode secundario

Lo primero es comprobar de donde va a leer los checkpoints, que no es otra cosa que los backups. Tenemos que comprobar en el xml de hdfs la variabla dfs.namenode.checkpoint.dir y creamos un directorio vario (o lo vaciamos)

1
2
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_> mkdir -p /data/secondary_nn/dfs/namesecondary
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_> chown hdfs:hadoop /data/secondary_nn/dfs/namesecondary

Copia del NN secundario por sco todos los contenidos a la carpeta de la variable dfs.namenode.checkpoint.dir

1
2
3
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_#> pwd
/data/secondary_nn/dfs/namesecondary
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_#> scp -r current hdm:/data/secondary_nn/dfs/namesecondary/

Cambia el owner y group en el NN primario

1
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_#> chown -R hdfs:hadoop /data/secondary_nn/dfs/namesecondary/*

ejecuta el siguiente comando como usuario hdfs (nosotros usamos hortonworks)

1
<<< CORP >>> hdfs@hdp-dw-1-nn-1:$_#> hdfs namenode -importCheckpoint

Arranca de nuevo los namenodes y datanodes, deberia de funcionar.
Asi nosotros, hemos solucionado una averia gordisima. Tardamos 4 horas pero no tuvimos que re-hacer el data lake :)

Animo!

Comentários

⬆︎TOP