Process freezes when swap file is used

Asked by cwarner7_11

I have noted that when running certain programs (typically heavy linear algebra applications, like FEA analysis on large meshes), when memory becomes challenged and swapping occurs, the program ultimately goes to sleep and never reawakens. Since the process may take several hours to run, it is only by running Process manager simultaneously that I can tell if swapping is occurring and if the process has gone off to dream land. No error messages. System specifics:

Ubuntu 10.04, Kernel 2.6.32-31-generic (64 bit), Pentium Dual Core T4200 @ 2 GHz, L2 Cache 1024 KB, 3 GB Memory, 16 GB Swap

My question is, is this an issue with how I have Ubuntu set up, or is this an issue with the particular software application I am running?

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu linux Edit question
Assignee:
No assignee Edit question
Solved by:
Thomas Krüger
Solved:
Last query:
Last reply:
Revision history for this message
Thomas Krüger (thkrueger) said :
#1

Could you please post the output of
top -b -n 1 | head -n 20
while the process is sleeping?

Revision history for this message
cwarner7_11 (cwarner7-11) said :
#2

OK, Thomas, here is the output of "top -b -n 1 | head -n 20":

top - 00:25:00 up 16:20, 3 users, load average: 3.77, 3.66, 2.30
Tasks: 217 total, 2 running, 215 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.2%us, 1.9%sy, 11.3%ni, 79.8%id, 3.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2957240k total, 2931476k used, 25764k free, 2908k buffers
Swap: 16129816k total, 2793316k used, 13336500k free, 96908k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25407 rocketma 20 0 328m 7000 3816 S 7 0.2 6:39.56 gnome-system-mo
 1118 root 20 0 169m 16m 6136 R 5 0.6 28:22.34 Xorg
 2409 rocketma 20 0 126m 18m 1316 S 4 0.6 18:46.87 beam.smp
26415 rocketma 20 0 4485m 1.8g 2052 D 2 64.7 1:30.32 SALOME_Session_
26912 rocketma 20 0 19356 1332 932 R 2 0.0 0:00.03 top
    1 root 20 0 23840 1424 920 S 0 0.0 0:00.71 init
    2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
    3 root RT 0 0 0 0 S 0 0.0 0:00.09 migration/0
    4 root 20 0 0 0 0 S 0 0.0 0:00.79 ksoftirqd/0
    5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    6 root RT 0 0 0 0 S 0 0.0 0:00.09 migration/1
    7 root 20 0 0 0 0 S 0 0.0 0:03.11 ksoftirqd/1
    8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1

Note that when the event happened, I had a bit of trouble getting the terminal to start to generate this output, and that I had several applications going (intentionally trying to force the fault). The fault also happens occasionally when I am running only one application (a finite element analysis program. when I have too fine a mesh). Additional information: memory usage at the time of the fault: 95% of 2.7 GB main memory in use, Swap 2.8 GB

Revision history for this message
Best Thomas Krüger (thkrueger) said :
#3

It is easy to spot that the process requires far more RAM than available.
While swap space can help if the memeory is used up, it is primarily intended to put swap the memory of sleeping processes to disk to free up RAM. If there is a process that consumes huge amounts of RAM that is accessed all the time and need a lot of CPU time, swapping is just very very slow and in some cases can lead to situations where the system can not free enough RAM even with swapping. This leads to suspended threads, high (blocking) I/O and an unresponsive system.

But here are some hints for you:
- Install more RAM! There is nothing that can replace RAM, except of even more RAM. ;-)
- Change the analysis script to use less memory. High speed computing it mostly a compromise between CPU time and memory.
- Increase the processes niceness (command "nice" or "top" and R key) to about 10. This will make the system more responsive.
- check "dmesg" or the console output of the application for additional warning and errors.

Revision history for this message
cwarner7_11 (cwarner7-11) said :
#4

Thank you very much- you have pretty much told me what the problem is. Installing more RAM, or purchase new computer? This is the decision point I am at right now. The new computer would have at least 4 cores instead of my current 2- I don't know if that would help as much as more RAM in the current set up. When I specify two cores for the problem it actually increases the CPU time, for some reason (although other software runs nicely with both CPU's at better than 90%- that is a question for a different forum). My normal approach has been changing the problem (reducing the mesh resolution) until it runs OK. One issue is that it can take 2 to 24 hours to run a project when everything is working- it is rather frustrating to come back in the morning to find that the system has frozen up. I have not tried playing with niceness- that might improve the situation.
Normally the program runs from a terminal, and if there is something wrong, I get errors in the terminal, but not in this failure mode. I haven't watched the dmesg- I will give that a try (although a lot of times the "messages" are so obscure as to be meaningless!).
The up side of all this is that I am running analyses that are orders of magnitude larger than similar problems I worked back in the 1980's on a super computer- punch cards, transmitted over a 300 baud modem, etc. And I am doing this on a laptop! One needs an historical perspective to appreciate that the problem has more to do with pushing the limits than any particular fault with the system or the software...