CHAPTER 5: Troubleshooting Guide

The following is a (surely incomplete) list of points to check when troubleshooting CPSR2. It represents the problems we have encountered and the procedures we have developed to fix them. As time goes by, the list will surely grow. Feel free to submit problems to the authors if you encounter a scenario that is not described here.

Current Hints -- UPDATE: 9/11/2004

The most common problem with CPSR2 recently has been GUIManager (green) freezing.

To solve this one:

  1. Kill the GUIManager
  2. On pegasus, find a window, be sure you have an agent running then:
    pegasus%  loopssh cpsr 1 30 "killall -9 dm_daemon;dm_daemon"
    
  3. On *any* cpsr machine, then type:

    GUIManager
    

That should fix it. It is happening every 5-6 hours.

If the GUIMonitor (blue) freezes, this is usually because a machine is dead, or an ekg is dead.

To try and rescue this:

  1. Kill the GUIMonitor
  2. On pegasus:
    pegasus% loopssh cpsr 1 30 "killall -9 ekg_daemon;ekg_daemon"
    
  3. Restart the GUIMonitor on a cpsr node.
  4. Master reset after this, then restart the 20cm_cpsr2_das and ObsMonitor programs again.

If a machine has crashed, the loopssh will freeze on a node which should be obvious. If this happens, go upstairs and reboot it.

Once it is rebooted, go back to step 2.

NOTE: After a reboot you will also need to restart the dm_daemon on that node.

Another "feature" we find is that sometimes tcs cannot start the observation. If this occurs try the following:

  1. Quit the GUIMonitor
  2. Kill the 20cm_cpsr2_das's (control c them)
  3. On pegasus:
    pegasus% killall -9 cpsr2d
    pegasus% cpsr2d
    
  4. Restart the GUIMonitor
  5. Hit master reset - wait until all slaves are back online
  6. Restart the das's on cpsr1 and cpsr2
  7. Hit the "connect" button on the main control panel (GUIMonitor)
  8. Reset all the parameters on the main control panel
  9. De-select and re-select CPSR2 in TCS
  10. Try to start an observation again on tcs

Note that killing cpsr2d will reset all the cpsr2 header parameters to their defaults, which may include turning off level setting. Once you have the das programs started, hit the "connect" button on the GUI front panel and go through all the entry boxes, ensuring that everything is set correctly by typing in the value and hitting "enter".

After restarting cpsr2d (in fact you might want to try this first, to make sure the problem isn't on the TCS end) be sure to:

De-select and re-select CPSR in TCS. This re-establishes the network control connection.

Archived Hints

In normal circumstances, most problems can be fixed using the "Master Reset" button on the manual control panel. This will kill all the CPSR2 related software on the cluster and re-start it, restoring the instrument to a functional state. It will also kill the ObsMonitor and das scripts (but not the monitoring GUIs), these will have to be re-started manually. If "Master Reset" fails to work, read further...

Frozen GUIMonitor of GUIManager

  • The most likely cause of a frozen GUI is a machine crash or daemon hang. We are working on ways to better detect this situation, but the best thing to do is loopssh around the cluster (See chapter 2) and kill the dm_daemons if it is the GUIManager that has frozen, or the ekg_daemons if it is the GUIMonitor that has frozen. If the loopssh gets stuck on a particular machine, chances are this is the one responsible and it will have to be re-booted.

One of the primary nodes fails to initialize

  • All the system parameter changes required to run the CPSR2 software are done at boot time, so in theory there should rarely be any trouble initializing the nodes anymore. The most likely explanation is that the connection between the GUI and the ekg_daemon on the machine has broken. Try restarting the daemon.

One of the secondary nodes refuses to come online

  • Same advice as for the primary node failure to initialize.

20cm_cpsr2_das reports an error at runtime

  • Provided the primary nodes have been initialized properly, this should not happen. Even if some of the machines have lost their gigabit connection, 20cm_cpsr2_das will time out after a few seconds, mark the node bad and continue with the startup sequence. The best thing to do in this situation is simply re-initialize the primary nodes.

20cm_cpsr2_das ignores the "GO" signal

  • cpsr2d is probably in a strange state. Follow the instructions directly below. If restarting cpsr2d fails to solve the problem:
  • The FFD might be stuck. Try the test procedures detailed in the FFD/EDT section below. If the FFD refuses to generate data, power cycle it.

20cm_cpsr2_das crashes when it receives the "GO" signal

  • You have probably encountered the infamous UTC start bug, or one of its subtle variations. This can be confirmed by reading the error message in the 20cm_cpsr2_das terminals. If it says something like "UTC not found in header", follow these steps to fix things:
    • Hit the "Abort" button on the GUIMonitor manual control page.
    • Quit the GUIMonitor.
    • Run the command killall -9 cpsr2d on pegasus. Then run cpsr2d &.
    • Restart the GUIMonitor.
    • Hit Master Reset and restart the 20cm_cpsr2_das processes.
    • Try to observe. It should work first time.
  • If the error message mentions Yamasaki test failures, follow the procedures outlined below for verifying the integrity of the FFD/EDT system.

Memory buffers overflow on a primary node

  • This is usually symptomatic of another problem somewhere else in the system. Here are the steps to follow:
  • Hit the "Abort" button on the GUIMonitor manual control page.
  • Find out which secondary node the primary was sending to at the time of the overflow. This information can be found on the GUIMonitor primary node panel.
  • Go to the GUIMonitor closeup page for this secondary node.
  • If one or both of the cpsr2_recv and cpsr2_dbdisk processes have crashed, hit the green "Bring Online" button.
  • If everything on the secondary node looks good, you might just be experiencing a slow network. Try again and see how things go.
  • Now go back to the primary node page and initialise both the primary nodes. You will have to restart the 20cm_cpsr2_das processes as well.
  • Be aware that a buffer overflow and subsequent abort can leave cpsr2d in a bad state. If you have trouble restarting the observation, see the previous three troubleshooting sections.

The folded profiles look strange

  • Make sure the cables are all plugged in to the "engineering test rack" upstairs in the correct order. If the bands and / or polarisations have been swapped, the signal will be de-dispersed to the wrong frequency. This should not be a problem unless the cables have been disconnected and reconnected for some reason.

The pulse arrival times look strange

  • Check that the primary nodes' internal clocks are reporting the correct time. If they are not, follow the directions in chapter 2.

You suspect the FFD/EDT system is not working properly

  • To check the integrity of data leaving the EDT cards, run /psr/cvshome/sba/yama_read and use /psr/cvshome/sba/ffd2_serial to start and stop the digitiser. Hit the primary node panel reset button for cpsr2 before attempting this to make sure nothing else is trying to use the serial port.
  • If the EDT cards get into a strange state, you can reload their configuration settings with /opt/EDTpcd/pcdload which MUST be run from the /opt/EDTpcd/ directory.

The primary nodes refuse to send data to one of the secondaries

  • The data acquisition code reads a text file to decide which nodes it is going to attempt to connect to. This list lives in /home/cpsr/20cmlist so make sure it is up-to-date! A previous observer may have deleted one of the machines from this file for some reason.


Back to the index