[Techtalk] Accessing a LUN From Multiple Linux Systems

Sun Jun 14 22:51:05 UTC 2009

On Sun, Jun 14, 2009 at 2:12 AM, Gayathri
Swaminathan<gayathri.swa at gmail.com> wrote:
>
>> And I'm stuck here:
>> http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_stuck
>> I'm not sure how to configure fencing methods and devices for a
>> two-node cluster.
>> Would the best way to do it be a qdisk?
>>
>
> Hi there,
>
> Try crossover cable prior to qdisk but you can troubleshoot this following
> manner:
>
> # group_tool -v; group_tool dump gfs
>
> Output from /var/log/messages when you did the previous commands. That will
> precisely tell you which node(s) is not in fence domain yet.
>
> So for troubleshooting sakes try (1) removing failing node from
> cluster.conf..see if it starts up correctly or, (2) manual ack/override
> fence_ack_manual -n <failing node>
>
> If that goes well try adding the failing node back into the fenced domain
> and give it a start..
>
> Assumed all the way through that you went RHEL cluster with shared GFS LUN.
>
> --
> Gayathri Swaminathan
> gpgkey: 3EFB3D39
> Volunteer, FDP
>

Hi all,

I didn't have the chance yet to try the crossover cable, because the
serves are in a datacenter and it would be weird for me to go there on
a weekend, I try to look like having a normal and healthy social life
=)

Each server creates its own cluster with the same name and tries to
fence the other. Meanwhile I lost one of them with my remote
experiments.

/var/log/messages shows repeatedly these messages:

Jun 14 23:15:16 BLALMS02 fenced[5479]: fencing node "BLALMS08.EXAMPLE.COM"
Jun 14 23:15:16 BLALMS02 fenced[5479]: fence "BLALMS08.EXAMPLE.COM" failed
Jun 14 23:15:21 BLALMS02 fenced[5479]: fencing node "BLALMS08.EXAMPLE.COM"
Jun 14 23:15:21 BLALMS02 fenced[5479]: fence "BLALMS08.EXAMPLE.COM" failed

I tried to manually fence the node with the two below commands:

[root at BLALMS02 ~]# fence_ack_manual -n BLALMS08
Warning:  If the node "BLALMS08" has not been manually fenced
(i.e. power cycled or disconnected from shared storage devices)
the GFS file system may become corrupted and all its data
unrecoverable!  Please verify that the node shown above has
been reset or disconnected from storage.
Are you certain you want to continue? [yN] y
done

[root at BLALMS02 ~]# fence_node BLALMS08

But /var/log/messages shows this message and keeps trying to fence the node:

Jun 14 23:15:29 BLALMS02 fence_node[10254]: Fence of "BLALMS08" was unsuccessful

Here are some more outputs:

[root at BLALMS02 ~]# cman_tool status
Version: 6.1.0
Config Version: 10
Cluster Name: blas
Cluster Id: 1525
Cluster Member: Yes
Cluster Generation: 344
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 9
Flags: 2node Dirty
Ports Bound: 0 11 177
Node name: BLALMS02.EXAMPLE.COM
Node ID: 2
Multicast addresses: 239.192.5.250
Node addresses: 172.26.73.189

[root at BLALMS02 ~]# clustat
Cluster Status for blas @ Sun Jun 14 23:14:50 2009
Member Status: Quorate
 Member Name                               ID   Status
 ------ ----                               ---- ------
 BLALMS08.EXAMPLE.COM                           1 Offline
 BLALMS02.EXAMPLE.COM                           2 Online, Local

[root at BLALMS02 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="blas" config_version="10" name="blas">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="BLALMS08.EXAMPLE.COM" nodeid="1" votes="1">
                        <fence>
                                <method name="1"/>
                        </fence>
                </clusternode>
                <clusternode name="BLALMS02.EXAMPLE.COM" nodeid="2" votes="1">
                        <fence>
                                <method name="1"/>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

[root at BLALMS02 ~]# group_tool -v; group_tool dump gfs
type             level name       id       state node id local_done
fence            0     default    00010002 JOIN_START_WAIT 2 200010001 0
[2]
dlm              1     clvmd      00020002 none
[2]
dlm              1     rgmanager  00030002 none
[2]
1245014652 config_no_withdraw 0
1245014652 config_no_plock 0
1245014652 config_plock_rate_limit 100
1245014652 config_plock_ownership 0
1245014652 config_drop_resources_time 10000
1245014652 config_drop_resources_count 10
1245014652 config_drop_resources_age 10000
1245014652 protocol 1.0.0
1245014652 listen 1
1245014652 cpg 5
1245014652 groupd 6
1245014652 uevent 7
1245014652 plocks 10
1245014652 plock cpg message size: 336 bytes
1245014652 setup done
1245016652 client 6: join /share gfs lock_dlm blas:share rw
/dev/mapper/share-share
1245016652 mount: /share gfs lock_dlm blas:share rw /dev/mapper/share-share
1245016652 share cluster name matches: blas
1245016652 mount: not in default fence domain
1245016652 share do_mount: rv -55
1245016652 client 6 fd 11 dead
1245016652 client 6 fd -1 dead
1245018000 client 6: dump

My problem might be very simple but my experience with clusters is
less than a week.

Any help will be appreciated.

Thanks,

Joana