星期三, 10月 30, 2013

Apache Hadoop namenode 手動恢復步驟

類比namenode崩潰,例如將name目錄的內容全部刪除,然後通過secondary namenode恢復namenode,抓圖實驗過程

If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported

Namenode主要有底下兩個檔案組成, fsimage 為存放checkpointnamespace , edits為異動的紀錄,可用來恢復namenode, 說明如下:

NameNode persists its namespace using two files:
fsimage, which is the latest checkpoint of the namespace and
edits, a journal (log) of changes to the namespace since the checkpoint.

如果NameNode節點掛了,可以按照如下步驟來從Secondary NameNode來恢復:

觀察:
fs.checkpoint.dir :secondary namenode, namenode皆有檔案
dfs.data.dir , namenode上面, secondary namenode會是空的
dfs.name.dir : 存放在namenode, secondary namenode上面會是空的
dfs.name.edits.dir: 存放在namenode, secondary namenode上面會是空的

結果發現以下兩個參數屬於namenode (nn) :
dfs.name.dir
dfs.name.edits.dir
兩個參數屬於secondary namenode (snn) :
fs.checkpoint.edits.dir
fs.checkpoint.dir

實際測試步驟如下:
hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop secondarynamenode

模擬數據全毀
rm -rf /opt/HDP/meta/*

建立資料夾回namenode
mkdir -p /home/hadoop/tmp /opt/HDP/meta /opt/HDP/meta/edits /opt/HDP/ckpoint/image/edits

ssh dn1 chmod 755 /opt/HDP/data
ssh dn2 chmod 755 /opt/HDP/data

@snncheckpoint dir 複製回nn
scp -pr nn2:/opt/HDP/ckpoint/* /opt/HDP/ckpoint/.
mkdir –p /opt/HDP/ckpoint/image/edits

$conf/core-site.xml暫時加入
   fs.checkpoint.dir
   /opt/HDP/ckpoint
   
   Determines where on the local filesystem theDFS secondary name node should store the temporary images to merge. If this isa comma-delimited list of directories then the image is replicated in all of thedirectories for redundancy.
   


   fs.checkpoint.edits.dir
   /opt/HDP/ckpoint/image/edits
   
   



hadoop namenode -importCheckpoint

13/10/29 22:55:15 INFO ipc.Server: IPC Server listener on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 1 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 2 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 4 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 5 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 6 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 7 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 8 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 3 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 0 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server handler 9 on 8020: starting
13/10/29 23:21:52 INFO logs: Aliases are enabled
13/10/29 23:22:10 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks is only 0 but the threshold is 0.9500 and the total blocks 3. Safe mode will be turned off automatically.
13/10/29 23:22:10 INFO ipc.Server: IPC Server handler 2 on 8020, call rollEditLog() from 192.168.35.131:45427: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks is only 0 but the threshold is 0.9500 and the total blocks 3. Safe mode will be turned off automatically.
[Ctrl + C]  è  

hadoop-daemon.sh start namenode

@another session, browse to http://nn1:50070



















[hadoop@nn1 meta]$ hadoop-daemon.sh start namenode
[hadoop@nn2 meta]$ hadoop-daemon.sh start secondarynamenode

[hadoop@nn1 meta]$ jps
10104 Jps
9786 NameNode
9964 JobTracker

[hadoop@nn1 meta]$ hadoop dfs -ls testjay
Found 1 items
-rw-r--r--   2 hadoop supergroup         16 2013-10-29 22:14 /user/hadoop/testjay/test128m
[hadoop@nn1 meta]$ hadoop dfs -cat testjay/test128m
12312dsofjosdjf
[hadoop@nn1 meta]$

[hadoop@nn1 meta]$ hadoop dfs -cat testjay/test128m
12312dsofjosdjf
[hadoop@nn1 meta]$ cd
[hadoop@nn1 ~]$ dd if=/dev/zero of=testfile bs=10240k count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.011087 seconds, 946 MB/s
[hadoop@nn1 ~]$ hadoop fs -put testfile testjay/testfile

tail -f hadoop-hadoop-secondarynamenode-nn2.log
eNode: Posted URL nn1:50070putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383058879000:1383058699786&newChecksum=319867456f25e3b20d3553248b09788b
2013-10-29 23:01:20,205 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://nn1:50070/getimage?putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383058879000:1383058699786&newChecksum=319867456f25e3b20d3553248b09788b
2013-10-29 23:01:20,262 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 989

[hadoop@nn1 ~]$ hadoop fs -rm testjay/testfile
Moved to trash: hdfs://nn1:8020/user/hadoop/testjay/testfile

tail -f hadoop-hadoop-secondarynamenode-nn2.log
2013-10-29 23:04:20,298 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 0 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
2013-10-29 23:04:20,299 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log: position=4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,299 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to 4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,330 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://nn1:50070/getimage?getimage=1
2013-10-29 23:04:20,339 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file fsimage size 989 bytes.
2013-10-29 23:04:20,339 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://nn1:50070/getimage?getedit=1
2013-10-29 23:04:20,344 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file edits size 976 bytes.
2013-10-29 23:04:20,344 INFO org.apache.hadoop.hdfs.util.GSet: Computing capacity for map BlocksMap
2013-10-29 23:04:20,344 INFO org.apache.hadoop.hdfs.util.GSet: VM type       = 64-bit
2013-10-29 23:04:20,345 INFO org.apache.hadoop.hdfs.util.GSet: 2.0% max memory = 1013645312
2013-10-29 23:04:20,345 INFO org.apache.hadoop.hdfs.util.GSet: capacity      = 2^21 = 2097152 entries
2013-10-29 23:04:20,345 INFO org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152
2013-10-29 23:04:20,575 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop
2013-10-29 23:04:20,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2013-10-29 23:04:20,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2013-10-29 23:04:20,577 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100
2013-10-29 23:04:20,577 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
2013-10-29 23:04:20,577 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: dfs.namenode.edits.toleration.length = -1
2013-10-29 23:04:20,578 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times
2013-10-29 23:04:20,581 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 11
2013-10-29 23:04:20,594 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0
2013-10-29 23:04:20,594 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Start loading edits file /opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,606 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: EOF of /opt/HDP/ckpoint/image/edits/current/edits, reached end of edit log Number of transactions found: 10.  Bytes read: 976
2013-10-29 23:04:20,606 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Edits file /opt/HDP/ckpoint/image/edits/current/edits of size 976 edits # 10 loaded in 0 seconds.
2013-10-29 23:04:20,607 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 0 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
2013-10-29 23:04:20,612 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log: position=976, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,612 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to 976, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,616 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file /opt/HDP/ckpoint/current/fsimage of size 1625 bytes saved in 0 seconds.
2013-10-29 23:04:20,712 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log: position=4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,713 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to 4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,728 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL nn1:50070putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383059060000:1383058880239&newChecksum=25aedb3e4690896fc583ba7bab176cd7
2013-10-29 23:04:20,728 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://nn1:50070/getimage?putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383059060000:1383058880239&newChecksum=25aedb3e4690896fc583ba7bab176cd7
2013-10-29 23:04:20,782 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 1625

以上恢復namenode心得, datanode不須重新啟動僅須停止nn, snn即可。

估計直接由snn恢復也是類似做法但須修改以下參數(原有的nn需改成snnip位置) :
fs.default.name
dfs.http.address
dfs.https.address
mapred.job.tracker
mapred.job.tracker.http.address

且調整以下參數到新的snn
dfs.secondary.http.address
$conf/masters 的內容也要改成新的snn

沒有留言:

LinkWithin-相關文件

Related Posts Plugin for WordPress, Blogger...