- 2.1.1. The eXpressWare drivers (IRM etc) fails to load after it has been installed for the first time.
- 2.1.2. Although the Network Manager is running on the Cluster Management Node, and all Cluster Nodes run the Node Manager, configuration changes are not applied to the adapters. I.e., the NodeId is not changed according to what is specified in /etc/dis/dishosts.conf on the Cluster Management Node.
- 2.1.3. The Network Manager on the Cluster Management Node refuses to start.
- 2.1.4. After a Cluster Node has booted, or after I restarted the Dolphin drivers on a Cluster Node, the first connection to a remote Cluster Node using SuperSockets does only deliver Ethernet performance. Retrying the connection then delivers the expected SuperSockets performance. Why does this happen?
- 2.1.5. Socket benchmarks show that SuperSockets are not active as the minimal latency is much more than 10us.
- 2.1.6. I am running a mixed 32/64-bit platform, and while the benchmarks latency_bench and sockperf from the DIS installation show good performance of SuperSockets, other applications do only show Ethernet performance for socket communication.
- 2.1.7. I have added the statement export LD_PRELOAD=libksupersockets.so to my shell profile to enable the use of SuperSockets. This works well on some machines, but on other machines, I get the error message ERROR: ld.so object 'libksupersockets.so' from LD_PRELOAD cannot be preloaded : ignore whenever I log in. How can this be fixed?
- 2.1.8. How can I permanently enable the use of SuperSockets for a user?
- 2.1.9. I can not build SISCI applications that are able to run on my cluster because the Cluster Management Node (where the SISCI-devel package was installed by the SIA) is a 32-bit machine, while my cluster Cluster Nodes are 64-bit machines (or vice versa). I fail to build the SISCI applications on the Cluster Nodes as the SISCI header files are missing. How can this deadlock be solved?
|
2.1.1. | The eXpressWare drivers (IRM etc) fails to load after it has been installed for the first time. |
| Please follow the procedure below to determine the cause of the problem. Verify that the adapter card has been recognized by the machine. This can be done as follows:
[root@n1 ~]# lspci -v | grep Dolphin
Subsystem: Dolphin Interconnect Solutions AS Device 0810
If this command does not show any output similar to the example above, the adapter card has not been recognized. Please try to power-cycle the system according to FAQ Q: 1.1.1. If this does not solve the issue, a hardware failure is possible. Please contact Dolphin support in such a case. Check the kernel message log for relevant messages. This can be done as follows:
# dmesg | grep PXH
# dmesg | grep PVH
Depending on which messages you see, proceed as described below: - PXH Driver : Preallocation failed
The driver failed to preallocate memory which will be used to export memory to remote Cluster Nodes. Rebooting the Cluster Node is the simplest solution to defragment the physical memory space. If this is not possible, or if the message appears even after a reboot, you need to adapt the preallocation settings (see Section 3.2, “dis_irm.conf”). - PXH Driver: Out of vmalloc space
See FAQ Q: 1.1.2.
If the driver still fails to load, please contact support and provide the driver's kernel log messages. # dmesg > /tmp/syslog_messages.txt
|
2.1.2. | Although the Network Manager is running on the Cluster Management Node, and all Cluster Nodes run the Node Manager, configuration changes are not applied to the adapters. I.e., the NodeId is not changed according to what is specified in /etc/dis/dishosts.conf on the Cluster Management Node. |
| The adapters in a Cluster Node can only be re-configured when they are not in use. This means, no adapter resources must be allocated via the dis_irm kernel module. To achieve this, make sure that upper layer services that use dis_irm (like dis_sisci and dis_supersockets) are stopped. On most Linux installations, this can be achieved like this (dis_services is a convenience script that come with the Dolphin software stack): # dis_services stop
...
# service dis_irm start
...
# service dis_nodemgr start
... |
2.1.3. | The Network Manager on the Cluster Management Node refuses to start. |
| In most cases, the interconnect configuration /etc/dis/dishosts.conf is corrupted. This can be verified with the command testdishosts. It will report problems in this configuration file, as in the example below: # testdishosts
socket member node-1_0 does not represent a physical adapter in dishosts.conf
DISHOSTS: signed32 dishostsAdapternameExists() failed In this case, the adapter name in the socket definition was misspelled. If testdishosts reports a problem, you can either try to fix /etc/dis/dishosts.conf manually, or re-create it with the Dolphin Network Configurator GUI, dis_netconfig or the command line version of the Dolphin Network Configurator, dis_mkconf . If this does not solve the problem, please check /var/log/dis_networkmgr.log for error messages. If you can not fix the problem reported in this log file, please contact Dolphin support providing the content of the log file. |
2.1.4. | After a Cluster Node has booted, or after I restarted the Dolphin drivers on a Cluster Node, the first connection to a remote Cluster Node using SuperSockets does only deliver Ethernet performance. Retrying the connection then delivers the expected SuperSockets performance. Why does this happen? |
| Make sure you run the Cluster Node manager on all Cluster Nodes of the cluster, and the Network Manager on the Cluster Management Node being correctly set up to include all Cluster Nodes in the configuration (/etc/dis/dishosts.conf ). The option Automatic Create Session must be enabled. This will ensure that the low-level "sessions" (Dolphin-internal) are set up between all Cluster Nodes of the cluster, and a SuperSockets connection will immediately succeed. Otherwise, the set-up of the sessions will not be done until the first connection between two Cluster Nodes is tried, but this is too late for the first connection to be established via SuperSockets. |
2.1.5. | Socket benchmarks show that SuperSockets are not active as the minimal latency is much more than 10us. |
| The half round-trip latency (ping-pong latency) with SuperSockets typically starts between 3 and 4us for very small messages. Any value above 7us for the minimal latency indicates a problem with the SuperSockets configuration, benchmark methodology of something else. Please proceed as follows to determine the reason: Is the SuperSockets service running on both Cluster Nodes? /etc/init.d/dis_supersockets status should report the status running . If the status is stopped , try to start the SuperSockets service with /etc/init.d/dis_supersockets start. Is LD_PRELOAD=libksupersockets.so set on both Cluster Nodes? You can check using the ldd command. Assuming the benchmark you want to run is named sockperf, do ldd sockperf. The libksupersockets.so should appear at the very top of the listing. Are the SuperSockets configured for the interface you are using? This is a possible problem if you have multiple Ethernet interfaces in your Cluster Nodes with the Cluster Nodes having different hostnames for each interface. SuperSockets may be configured to accelerate not all of the available interfaces. To verify this, check which IP addresses (or subnet mask) are accelerated by SuperSockets by looking at /proc/net/af_ssocks/socket_maps (Linux) and use those IP addresses (or related hostnames) that are listed in this file. If the SuperSockets service refuses to start, or only starts into the mode running, but not configured , you probably have a corrupted configuration file /etc/dis/dishosts.conf : verify that this file is identical to the same file on the Cluster Management Node. If not, make sure that the Network Manager is running on the Cluster Management Node (/etc/init.d/dis_networkmgr start). If the dishosts.conf files are identical on Cluster Management Node and Cluster Node, they could still be corrupted. Please run the dis_netconfigordis_mkconf on the Cluster Management Node to have it load /etc/dis/dishosts.conf ; then save it again (dis_mkconf and dis_netconfig will always create syntactically correct files). Please check the system log using the dmesg command. Any output there from either dis_ssocks or af_ssocks should be noted and reported to <pci-support@dolphinics.com> .
|
2.1.6. | I am running a mixed 32/64-bit platform, and while the benchmarks latency_bench and sockperf from the DIS installation show good performance of SuperSockets, other applications do only show Ethernet performance for socket communication. |
| Please use the file command to verify if the applications that fail to use SuperSockets are 32-bit applications. If they are, please verify if the 32-bit SuperSockets library can be found as /opt/DIS/lib/libksupersockets.so (this is a link). If this file is not found, then it could not be built due to a missing or incomplete 32-bit compilation environment on your build machine. This problem is indicated by the message #* WARNING: 32-bit applications may not be able to use SuperSockets of the SIA. If on a 64-bit platform 32-bit libraries can not be built, the RPM packages will still be built successfully (without 32-bit libraries included) as many users of 64-bit platforms only run native 64-bit applications. To fix this problem, make sure that the 32-bit versions of the glibc and libgcc-devel (packages) are installed on your build machine, and re-build the binary RPM packages using the SIA option --build-rpm , making sure that the warning message shown above does not appear. Then, replace the existing RPM package Dolphin-SuperSockets with the one you have just build. Alternatively, you can perform a complete re-installation. |
2.1.7. | I have added the statement export LD_PRELOAD=libksupersockets.so to my shell profile to enable the use of SuperSockets. This works well on some machines, but on other machines, I get the error message ERROR: ld.so object 'libksupersockets.so' from LD_PRELOAD cannot be preloaded : ignore whenever I log in. How can this be fixed? |
| This error message is generated on machines that do not have SuperSockets installed. On these machines, the linker can not find the libksupersockets.so library. This can be fixed to set the LD_PRELOAD environment variable only if SuperSockets are running. For a sh-type shell such as bash, use the following statements in the shell profile ($HOME/.bashrc ): [ -d /proc/net/af_ssocks ] && export LD_PRELOAD=libksupersockets.so |
2.1.8. | How can I permanently enable the use of SuperSockets for a user? |
| This can be achieved by setting the LD_PRELOAD environment variable in the users' shell profile (i.e. $HOME/.bashrc for the bash shell). This should be done conditionally by checking if SuperSockets are running on this machine: [ -d /proc/net/af_ssocks ] && export LD_PRELOAD=libksupersockets.so Of course, it is also possible to perform this setting globally (in /etc/profile ). |
2.1.9. | I can not build SISCI applications that are able to run on my cluster because the Cluster Management Node (where the SISCI-devel package was installed by the SIA) is a 32-bit machine, while my cluster Cluster Nodes are 64-bit machines (or vice versa). I fail to build the SISCI applications on the Cluster Nodes as the SISCI header files are missing. How can this deadlock be solved? |
| When the SIA installed the cluster, it has stored the binary RPM packages in different directories node_RPMS , frontend_RPMS and source_RPMS . You will find a SISCI-devel RPM that can be installed on the Cluster Nodes in the node_RPMS directory. If you can not find these RPM file, you can recreate them on one of the Cluster Nodes using the SIA with the --build-rpm option. Once you have the Dolphin-SISCI-devel binary RPM, you may need to install it on the Cluster Nodes using the --force option of rpm because the library files conflict between the installed SISCI and the SISCI-devel RPM: # rpm -i --force Dolphin-SISCI-devel.<arch>.<version>.rpm |