"Ramses" Diskless Cluster Setup

(Note: the following info was adapted in 2003 from the original text by alumnus Essam Metwally).

Before you start, you should have a reasonably strong command of Linux. Furthermore, we take no responsibility for any loss of data, damage etc etc etc... you get the idea.  The following guide should be fairly complete but we may have overlooked something or made a typo here or there.  Any comments and/or suggestions are always appreciated.

Initially this work was done by our group member Essam Metwally in 2001/2002 after extensive reading and examination of other diskless cluster setups on the net, in particular Arthur Weaver of the Cornell SIRIUS: MacChess Cluster, although his installation recipe did not work perfectly for us, many of the steps were adapted from his example.  We have simply refined things for our own particular configuration and fixed certain steps.  

  1. Install Linux on the server.  We used RedHat 7.2 ISO's downloaded and burned from their site. You may either do a "Custom" install or, as we did, install EVERYTHING.  You never know what you're going to need, besides diskspace is nowhere as costly as it once was.
  2. After install, update glibc to whatever the latest version is,  we used 2.4.18 available from Red Hat support but nowadays there is probably a newer version.  Use the Rpms or grab the source
        # rpm -Fvh glibc*.rpm
  3. Grab the latest stable build of the kernel (2.4.18 as of this writing).  Once done, you will need to create two custom kernels, one for the server, and one for the diskless clients.  First the server:
    Untar the source to your home directory:
        # tar -zxvh linux-2.4.18.tar.gz
  4. This will both decompress and untar the source to a directory called linux, the following is optional, you may move the source to the source tree in /usr/src OR continue the steps from your home directory.  Its not really relevant.  We prefer the source tree as it simplifies installation of packages later on BUT IF YOU MOVE IT TO /usr/src BE SURE NOT TO OVERWRITE THE SOURCE TREE PRESENT.
        # mv linux /usr/src/linux-2.4.18
      # cd /usr/src
    check if there is a symbolic link (usually yes) of linux to whatever version of the kernel is currently installed.  If yes, remove it and have it point to the new source by entering:
         # rm linux
       # ln -s linux-2.4.18 linux
    Time to customize:
         # cd linux
       # make xconfig
    We won't go too much into customization other than some essentials, but customize as you see fit.  Load RamsesServer.conf and alter as necessary.  In particular check the network device currently set to 3Com Vortex.  Save and Exit.
        # make dep
      # make clean
      # make -j 16
      # make modules
    Assuming no errors,
        # make install
      # make modules_install
    If you are using lilo boot manager, edit /etc/lilo.conf and make sure that it includes the new kernel.  Then run:
        # /sbin/lilo -v
    If you are using grub, edit /boot/grub/grub.conf and add the kernel as appropriate.  Since grub is located in the boot partition, nothing more need be done.  Restart and boot with your freshly created kernel:
        # shutdown -r now
    Create a custom kernel for the diskless nodes.  This is a very stripped down version because really, all you have are processor(s), memory, motherboard, and maybe a card or two.  Feel free to use RamsesCluster.conf again check to make sure that the selections are appropriate for your system paying particular attention to the network configuration
        # cd /usr/src/linux (OR WHEREVER YOU LEFT IT)
      # make xconfig
    Load RamsesCluster.conf
    Save and Exit.
        # make clean
      # make dep
      # make -j 16 bzImage
    Make a network bootable image of the client kernel using the tagging utility mknbi. The utility is available at http://etherboot.sourceforge.net/.  We recommend version 1.0.6. Supposedly there is an incompatibility issue with a utility we use later on (but we did not experience this, so experiment with it). This supposedly incompatible utility is called imggen and is only necessary for 3COM cards as far as we know (so if you don't use 3COM then don't worry about it and get the latest version of mknbi).
        # rpm -ivh mknbi-1.0.6.noarch.rpm
      # cd /tftpboot
     # mknbi-linux --output=/tftpboot/vmlinuz-2.4.18-cluster \                 --ipaddrs=rom \               --rootdir=/ \        -- append="ramdisk_size=1024" \        /usr/src/linux/arch/i386/boot/bzImage
    IF you are using a 3COM card, you need to get imggen from LTSP contributions webpage.  The file is called imggen_v1.01.tgz.
        # tar -zxvf imggen_v1.01.tgz
      # chmod 755 imggen
      # mv imggen /sbin
      # cd /tftpboot
      # /sbin/imggen -a vmlinuz-2.4.18-cluster vmlinuz-2.4.18-cluster-imggen
  5. Setup the Managed Boot Agent for each of the clients.   We were using 3Com Vortex cards so this may or may not apply to you.  We need the MAC addressed to continue with our server setup so we may as well set these up at the same time.  Configure them to use either DHCP or BOOTP.

            Ctrl-Alt-B ( during bootup of client )
            Boot Method: TCP/IP
            Protocol:    DHCP
            Config Message:  Enabled
            Message Timeout:  3 seconds
            Boot Failure Prompt:
            Boot Failure:    Reboot

    Make sure to copy down the MAC Address (XX:XX:XX:XX:XX:XX) for each machine!

  6. Configure DHCP daemon on the server.  Edit /etc/dhcpd.conf to look like:
       
    default-lease-time 21600;
    max-lease-time 21600;

    option subnet-mask 255.255.255.0; option broadcast-address 192.168.0.255; option routers 192.168.0.1; option root-path "/";
    option domain-name-servers 192.168.0.1; option domain-name "";
    shared-network CLUSTER {     subnet 192.168.0.0 netmask 255.255.255.0 {} }
    group {     use-host-decl-names on;     option log-servers 192.168.0.1;
        host ramses2 {         hardware ethernet XX:XX:XX:XX:XX:XX;         fixed-address 192.168.0.2;         filename "vmlinuz-2.4.18-cluster-imggen";     } ... ... }
    Add a machine definition as in ramses2 for each diskless machine in your cluster.

    Restart the DHCP daemon by issuing:
        # /etc/rc.d/init.d/dhcpd restart
  7. Configure Trivial FTP (tftp) to be operational under xinetd.  Edit /etc/xinetd.d/tftp and set: 
            
             disable = no
             server_args  = -s /tftpboot

    Restart the xinetd
        # /etc/rc.d/init.d/xinetd restart
  8. Download and install ClusterNFS from http://clusternfs.sourceforge.net
        # tar -xvf ClusterNFS-3.0-rc1.tar
     # cd ClusterNFS-3.0-rc1.tar
     # ./BUILD
     # make -j 16 install
  9. Configure NFS to make use of clusternfs instead of the nfs daemon.  Edit the /etc/init.d/nfs :
            Comment out the line:
                RPCNFSDCOUNT=XX
            to
                #RPCNFSDCOUNT=XX

            change all instances of:
                daemon rpc.mountd $RPCMOUNTDOPTS
                daemon rpc.nfsd $RPCNFSDCOUNT
            to
                # daemon rpc.mountd $RPCMOUNTDOPTS
                /usr/sbin/rpc.mountd $RPCMOUNTDOPTS
           
                # daemon rpc.nfsd $RPCNFSDCOUNT
                /usr/sbin/rpc.nfsd -l --translate-names $RPCNFSDCOUNT

    There should be two occurances of the mountd daemon (one in start and one in restart) and one instance of nfsd (one in start).

  10. Modify /etc/exports:        
        # echo "/ 192.168.0.0/255.255.255.0(rw,no_root_squash)" >> /etc/exports
     # echo "/tftpboot/ 192.168.0.0/255.255.255.0(rw,no_root_squash)" >> /etc/exports
    If the file doesn't exist then change the first ">>" to ">"

  11. Edit the hosts file as per:
                /etc/hosts
                /etc/hosts.deny
                /etc/hosts.allow

  12. Using /sbin/chkconfig issue:
        # /sbin/chkconfig --level 2345 atd off
     # /sbin/chkconfig --level 2345 autofs off
     # /sbin/chkconfig --level 2345 apmd off
    # /sbin/chkconfig --level 2345 ipchains off
     # /sbin/chkconfig --level 2345 sendmail off
     # /sbin/chkconfig --level 2345 linuxconf on
     # /sbin/chkconfig --level 2345 dhcpd on
  13. Modify the /etc/rc.sysinit or just overwrite it with this one (we recommend backing up your original, and we take no responsibility if your system doesn't reboot; do a line by line comparison modifying as necessary.)

  14. While we're at it, we will temporarily modify /etc/init.d/network
          Comment out the line:
                 touch /var/lock/subsys/network

    We will undo this at the end, what it does is prevent the creation of a network lock file.  On any shutdown or reboot of the diskless clients, premature termination of the network subsystems cause the machines to hang.  Realistically, since everything is on the server anyways then in no event should the network be shutdown.

  15. All that's left, unless you want to install programs like Netperf or LMSensors, is to setup the client directories.  Run the script make_clusternfs_client once for each client node
        # make_clusternfs_client << CLIENTID HERE 2 through 254 >> 
  16. We'll want to reverse the change we made to /etc/init.d/network by uncommenting the following line:

               touch /var/lock/subsys/network

Step 14 through 16 can be repeated ad nauseum.  If a client stops working just rebuild it.  It takes a minute to rebuild rather than hours to track down the problem.  Step 16 is optional.  We have never seen a problem resulting from the failure of the network interfaces to shutdown.  If you skip the last step then you don't have to worry (obviously) about step 14 when you want to rebuild later.

Hopefully this helps someone else out there.  We would again like to thank Arthur Weaver for his invaluable assistance.

Main Page