What is SREE scheduler doing?

From PTAGISWiki

Jump to: navigation, search

When someone constructs a malformed or anti-social query with the querybuilder tool, it consumes all available CPU on the machine hosting the scheduler process.

It would be helpful to understand what query is running that is malformed so that feedback can be provided to the user and the decision can be made if it would be helpful to terminate the job.

This situation is often first reported by users who see no feedback from their attempts to run queries because the scheduler is too busy to answer them.

Right now reedi is hosting the scheduler and the load on reedi is high:

  2:30pm  up 78 day(s),  3:39,  2 users,  load average: 3.43, 3.18, 3.12
last pid:  4529;  load averages:  3.46,  3.21,  3.14                                                             14:30:59
194 processes: 184 sleeping, 3 running, 5 zombie, 2 on cpu
CPU states:  3.5% idle, 65.8% user, 30.6% kernel,  0.1% iowait,  0.0% swap
Memory: 2048M real, 81M free, 1862M swap in use, 1356M swap free

   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 11518 root      78  30    0  491M  349M cpu/1  836:36 32.53% java
 11344 root      88  59    0  734M  410M run     37.6H  5.68% java

Note that idle time is between 0 and 10%. The top process is java. Load has been high for some time as the 15-minute average in uptime is above 3.

ps -eaf | grep java

shows that process 11518 is the scheduler because of its unique memory invocation signature.

truss -p 11518

shows that the scheduler is writing to /global/ds1/cache as it builds its reports. Simply checking the most recent files in /global/ds1/cache should give us a good idea of what reports are running. It would be helpful if these files were owned by ptagis user ids, but unfortunately they are all owned by root since the scheduler runs as root.

There are several types of files in this directory:

rufus.psmfc.org:C1:root: > ls -lt | head
total 6241622
-rw-r--r--   1 root     other        185 Jan  9 14:36 sptable1168382194143.dat
-rw-r--r--   1 root     other        159 Jan  9 14:36 sptable1168382194199.dat
-rw-r--r--   1 root     other      11476 Jan  9 14:35 storage_temp1168382114834.dat
-rw-r--r--   1 root     other    54083463 Jan  9 13:29 ptable1168377988382.dat
-rw-r--r--   1 root     other    3347728 Jan  9 13:29 ptable1168377988382.idx
-rw-r--r--   1 root     other    64958148 Jan  9 13:26 ptable1168377770251.dat
-rw-r--r--   1 root     other    4000004 Jan  9 13:26 ptable1168377770251.idx
-rw-r--r--   1 root     other    64923034 Jan  9 13:22 ptable1168377488660.dat

The *.dat files are data, the *.idx files are indexes. Peeking at a file can be done like this:

strings ptable1168377988382.dat | head

or

strings ptable1168377988382.dat | tail

Unfortunately, we don't get many clues from looking at the files that the scheduler has open:


reedi.psmfc.org:C1:root: > lsof -p 11518
COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF     NODE NAME
java    11518 root  cwd   VDIR        85,105      10752    67620 /usr/local/bea81/user_projects/mydomain2
java    11518 root  txt   VREG        85,105      36396    63654 /usr/local/bea81/jdk141_02/jre/bin/java
java    11518 root  txt   VREG        85,105     593260    63705 /usr/local/bea81/jdk141_02/jre/lib/sparc/libfontmanager.so
java    11518 root  txt   VREG        85,105     365240    63692 /usr/local/bea81/jdk141_02/jre/lib/sparc/libawt.so
java    11518 root  txt                                          unknown file system type (ufs), v_op: 0x78426120
java    11518 root  txt   VREG        85,105    1227013    56763 /usr/local/bea81/weblogic81/common/eval/pointbase/lib/pbserver44.jar
java    11518 root  txt   VREG        85,105     653661    56542 /usr/local/bea81/weblogic81/server/lib/ant/optional.jar
java    11518 root  txt   VREG        85,105   37058122    56711 /usr/local/bea81/weblogic81/server/lib/weblogic.jar
java    11518 root  txt   VREG        85,105     381604    63690 /usr/local/bea81/jdk141_02/jre/lib/sparc/libmlib_image.sojava    11518 root  txt   VREG        85,105    4928372    64278 /usr/local/bea81/jdk141_02/lib/tools.jar
java    11518 root  txt   VREG        85,105     721670    56540 /usr/local/bea81/weblogic81/server/lib/ant/ant.jar
java    11518 root  txt   VREG        85,105    1809245    56715 /usr/local/bea81/weblogic81/server/lib/webservices.jar
java    11518 root  txt                                          unknown file system type (ufs), v_op: 0x78426120
java    11518 root  txt                                          unknown file system type (ufs), v_op: 0x78426120
java    11518 root  txt   VREG        85,105    1187621    56666 /usr/local/bea81/weblogic81/server/lib/ojdbc14.jar
java    11518 root  txt   VREG        85,105     650627    63718 /usr/local/bea81/jdk141_02/jre/lib/ext/localedata.jar
java    11518 root  txt   VREG        85,105      57680    63685 /usr/local/bea81/jdk141_02/jre/lib/sparc/libnet.so
java    11518 root  txt   VREG        85,105     174008    63689 /usr/local/bea81/jdk141_02/jre/lib/sparc/libdcpr.so
java    11518 root  txt   VREG        85,105       2624    63708 /usr/local/bea81/jdk141_02/jre/lib/sparc/librmi.so
java    11518 root  txt   VREG        85,105     229972    63706 /usr/local (/dev/md/dsk/d105)
java    11518 root  txt                                          unknown file system type (ufs), v_op: 0x78426120
java    11518 root  txt   VREG        85,105      11068    63702 /usr/local/bea81/jdk141_02/jre/lib/sparc/headless/libmawt.so
java    11518 root  txt   VREG        85,105     251659    56762 /usr/local/bea81/weblogic81/common/eval/pointbase/lib/pbclient44.jar
java    11518 root    0u  FIFO 0x30024f29020        0t0 33437698 (fifofs) ->0x30024f29128
java    11518 root    1u  FIFO 0x30317cb2368        0t1 33437699 (fifofs) ->0x30317cb2260
java    11518 root    2u  FIFO 0x30317cb25a8        0t1 33437700 (fifofs) ->0x30317cb24a0
java    11518 root    3u  VCHR         13,12        0t0   220055 /devices/pseudo/mm@0:zero
java    11518 root    4r  DOOR         338,0        0t0       55 /var/run (swap) (door to nscd[303])
java    11518 root    5u  IPv4 0x39a12cb0338 0x230dbff4      TCP localhost:44737->localhost:21071 (ESTABLISHED)
java    11518 root    6w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root    7u                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root    8w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root    9w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   11u  IPv4 0x300026b4f40  0x685f693      TCP localhost:44731->localhost:21071 (ESTABLISHED)
java    11518 root   12u  IPv4 0x38ee1eff3a8   0t180106      TCP localhost:44741->localhost:21071 (ESTABLISHED)
java    11518 root   13u  IPv4 0x399d2ac6060  0xbe8cd39      TCP localhost:44745->localhost:21071 (ESTABLISHED)
java    11518 root   14u                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   15w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   16u                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   17u                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   18w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   20u  IPv4 0x312a0359270   0t327244      TCP reedi.psmfc.org:38068->ldapcluster.psmfc.org:ldap (ESTABLISHED)
java    11518 root   22u  IPv4 0x3000b8f0650   0t147303      TCP localhost:37478->localhost:21071 (ESTABLISHED)
java    11518 root   23w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   24u  IPv4 0x3000b8e90d0 0t55625445      TCP localhost:55665->localhost:21071 (ESTABLISHED)
java    11518 root   25w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   28u  IPv4 0x31599c69270   0t146661      TCP localhost:37483->localhost:21071 (ESTABLISHED)
java    11518 root   30w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   31r  VREG        85,103     137484   259189 /usr (/dev/md/dsk/d103)
java    11518 root   32r  VREG        85,103     131008   259190 /usr (/dev/md/dsk/d103)
java    11518 root   33r  VREG        85,103      75144   591367 /usr -- LucidaBrightDemiBold.ttf
java    11518 root   34r  VREG        85,103      75124   591368 /usr -- LucidaBrightDemiItalic.ttf
java    11518 root   35r  VREG        85,103      80856   591369 /usr -- LucidaBrightItalic.ttf
java    11518 root   36r  VREG        85,103     344908   591738 /usr (/dev/md/dsk/d103)
java    11518 root   37r  VREG        85,103     208628   591370 /usr (/dev/md/dsk/d103)
java    11518 root   38r  VREG        85,103     698236   591371 /usr -- LucidaSansRegular.ttf
java    11518 root   39r  VREG        85,103     141272   591372 /usr (/dev/md/dsk/d103)
java    11518 root   40r  VREG        85,103     242700   591739 /usr (/dev/md/dsk/d103)
java    11518 root   41u  IPv4 0x30304088670        0t0      TCP *:44596 (LISTEN)
java    11518 root   42u  IPv4 0x3030320d0e0        0t0      TCP localhost:44594->localhost:44593 (BOUND)
java    11518 root   43w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   44u  IPv4 0x3000b1c50b0        0t0      TCP *:1111 (LISTEN)
java    11518 root   45w                                         unknown file system type (ufs), v_op: 0x78426120
java    11518 root   46u  IPv4 0x312a03413e8 0x583332c4      TCP localhost:37472->localhost:21071 (ESTABLISHED)
reedi.psmfc.org:C1:root: >

All the interesting bits are marked as "unknown file system type (ufs)". Bummer.

Who is doing what?

The goal is to match this dataset to a user and a query so that we can understand if something was done incorrectly.

Successfully completed reports are in /global/ds1/pitweb-1.0/cvswd. Seeing what has changed recently there tells us who is getting things done:

rufus.psmfc.org:C1:root: > ls -lt | head
total 924
drwxr-xr-x   2 root     other       3584 Jan  9 13:51 fmonzyk
drwxrwxr-x   2 root     other      12288 Jan  9 13:36 biolines
drwxr-xr-x   2 root     other       2560 Jan  9 12:34 rhino06
drwxr-xr-x   2 root     other        512 Jan  9 12:03 Bummer
drwxrwxr-x   2 root     other        512 Jan  9 11:34 morrisj
drwxr-xr-x   2 root     other       2560 Jan  9 11:14 steveb
drwxr-xr-x   2 root     other       1024 Jan  9 11:10 rday
drwxr-xr-x   3 root     other       1024 Jan  9 11:03 gasvoda
drwxr-xr-x   2 root     other       1024 Jan  9 10:59 brianm

But these are reports that have already finished, not reports that are in progress. Still, it may be a clue about which users are currently active.

Interpret the partial data

One good clue about the owner of a report is to look for tag IDs and see who is the coordinator for those tags. This may correlate with the most active users as determined above.

Personal tools