What is SREE scheduler doing?
From PTAGISWiki
When someone constructs a malformed or anti-social query with the querybuilder tool, it consumes all available CPU on the machine hosting the scheduler process.
It would be helpful to understand what query is running that is malformed so that feedback can be provided to the user and the decision can be made if it would be helpful to terminate the job.
This situation is often first reported by users who see no feedback from their attempts to run queries because the scheduler is too busy to answer them.
Right now reedi is hosting the scheduler and the load on reedi is high:
2:30pm up 78 day(s), 3:39, 2 users, load average: 3.43, 3.18, 3.12
last pid: 4529; load averages: 3.46, 3.21, 3.14 14:30:59 194 processes: 184 sleeping, 3 running, 5 zombie, 2 on cpu CPU states: 3.5% idle, 65.8% user, 30.6% kernel, 0.1% iowait, 0.0% swap Memory: 2048M real, 81M free, 1862M swap in use, 1356M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 11518 root 78 30 0 491M 349M cpu/1 836:36 32.53% java 11344 root 88 59 0 734M 410M run 37.6H 5.68% java
Note that idle time is between 0 and 10%. The top process is java. Load has been high for some time as the 15-minute average in uptime is above 3.
ps -eaf | grep java
shows that process 11518 is the scheduler because of its unique memory invocation signature.
truss -p 11518
shows that the scheduler is writing to /global/ds1/cache as it builds its reports. Simply checking the most recent files in /global/ds1/cache should give us a good idea of what reports are running. It would be helpful if these files were owned by ptagis user ids, but unfortunately they are all owned by root since the scheduler runs as root.
There are several types of files in this directory:
rufus.psmfc.org:C1:root: > ls -lt | head total 6241622 -rw-r--r-- 1 root other 185 Jan 9 14:36 sptable1168382194143.dat -rw-r--r-- 1 root other 159 Jan 9 14:36 sptable1168382194199.dat -rw-r--r-- 1 root other 11476 Jan 9 14:35 storage_temp1168382114834.dat -rw-r--r-- 1 root other 54083463 Jan 9 13:29 ptable1168377988382.dat -rw-r--r-- 1 root other 3347728 Jan 9 13:29 ptable1168377988382.idx -rw-r--r-- 1 root other 64958148 Jan 9 13:26 ptable1168377770251.dat -rw-r--r-- 1 root other 4000004 Jan 9 13:26 ptable1168377770251.idx -rw-r--r-- 1 root other 64923034 Jan 9 13:22 ptable1168377488660.dat
The *.dat files are data, the *.idx files are indexes. Peeking at a file can be done like this:
strings ptable1168377988382.dat | head
or
strings ptable1168377988382.dat | tail
Unfortunately, we don't get many clues from looking at the files that the scheduler has open:
reedi.psmfc.org:C1:root: > lsof -p 11518 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 11518 root cwd VDIR 85,105 10752 67620 /usr/local/bea81/user_projects/mydomain2 java 11518 root txt VREG 85,105 36396 63654 /usr/local/bea81/jdk141_02/jre/bin/java java 11518 root txt VREG 85,105 593260 63705 /usr/local/bea81/jdk141_02/jre/lib/sparc/libfontmanager.so java 11518 root txt VREG 85,105 365240 63692 /usr/local/bea81/jdk141_02/jre/lib/sparc/libawt.so java 11518 root txt unknown file system type (ufs), v_op: 0x78426120 java 11518 root txt VREG 85,105 1227013 56763 /usr/local/bea81/weblogic81/common/eval/pointbase/lib/pbserver44.jar java 11518 root txt VREG 85,105 653661 56542 /usr/local/bea81/weblogic81/server/lib/ant/optional.jar java 11518 root txt VREG 85,105 37058122 56711 /usr/local/bea81/weblogic81/server/lib/weblogic.jar java 11518 root txt VREG 85,105 381604 63690 /usr/local/bea81/jdk141_02/jre/lib/sparc/libmlib_image.sojava 11518 root txt VREG 85,105 4928372 64278 /usr/local/bea81/jdk141_02/lib/tools.jar java 11518 root txt VREG 85,105 721670 56540 /usr/local/bea81/weblogic81/server/lib/ant/ant.jar java 11518 root txt VREG 85,105 1809245 56715 /usr/local/bea81/weblogic81/server/lib/webservices.jar java 11518 root txt unknown file system type (ufs), v_op: 0x78426120 java 11518 root txt unknown file system type (ufs), v_op: 0x78426120 java 11518 root txt VREG 85,105 1187621 56666 /usr/local/bea81/weblogic81/server/lib/ojdbc14.jar java 11518 root txt VREG 85,105 650627 63718 /usr/local/bea81/jdk141_02/jre/lib/ext/localedata.jar java 11518 root txt VREG 85,105 57680 63685 /usr/local/bea81/jdk141_02/jre/lib/sparc/libnet.so java 11518 root txt VREG 85,105 174008 63689 /usr/local/bea81/jdk141_02/jre/lib/sparc/libdcpr.so java 11518 root txt VREG 85,105 2624 63708 /usr/local/bea81/jdk141_02/jre/lib/sparc/librmi.so java 11518 root txt VREG 85,105 229972 63706 /usr/local (/dev/md/dsk/d105) java 11518 root txt unknown file system type (ufs), v_op: 0x78426120 java 11518 root txt VREG 85,105 11068 63702 /usr/local/bea81/jdk141_02/jre/lib/sparc/headless/libmawt.so java 11518 root txt VREG 85,105 251659 56762 /usr/local/bea81/weblogic81/common/eval/pointbase/lib/pbclient44.jar java 11518 root 0u FIFO 0x30024f29020 0t0 33437698 (fifofs) ->0x30024f29128 java 11518 root 1u FIFO 0x30317cb2368 0t1 33437699 (fifofs) ->0x30317cb2260 java 11518 root 2u FIFO 0x30317cb25a8 0t1 33437700 (fifofs) ->0x30317cb24a0 java 11518 root 3u VCHR 13,12 0t0 220055 /devices/pseudo/mm@0:zero java 11518 root 4r DOOR 338,0 0t0 55 /var/run (swap) (door to nscd[303]) java 11518 root 5u IPv4 0x39a12cb0338 0x230dbff4 TCP localhost:44737->localhost:21071 (ESTABLISHED) java 11518 root 6w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 7u unknown file system type (ufs), v_op: 0x78426120 java 11518 root 8w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 9w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 11u IPv4 0x300026b4f40 0x685f693 TCP localhost:44731->localhost:21071 (ESTABLISHED) java 11518 root 12u IPv4 0x38ee1eff3a8 0t180106 TCP localhost:44741->localhost:21071 (ESTABLISHED) java 11518 root 13u IPv4 0x399d2ac6060 0xbe8cd39 TCP localhost:44745->localhost:21071 (ESTABLISHED) java 11518 root 14u unknown file system type (ufs), v_op: 0x78426120 java 11518 root 15w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 16u unknown file system type (ufs), v_op: 0x78426120 java 11518 root 17u unknown file system type (ufs), v_op: 0x78426120 java 11518 root 18w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 20u IPv4 0x312a0359270 0t327244 TCP reedi.psmfc.org:38068->ldapcluster.psmfc.org:ldap (ESTABLISHED) java 11518 root 22u IPv4 0x3000b8f0650 0t147303 TCP localhost:37478->localhost:21071 (ESTABLISHED) java 11518 root 23w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 24u IPv4 0x3000b8e90d0 0t55625445 TCP localhost:55665->localhost:21071 (ESTABLISHED) java 11518 root 25w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 28u IPv4 0x31599c69270 0t146661 TCP localhost:37483->localhost:21071 (ESTABLISHED) java 11518 root 30w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 31r VREG 85,103 137484 259189 /usr (/dev/md/dsk/d103) java 11518 root 32r VREG 85,103 131008 259190 /usr (/dev/md/dsk/d103) java 11518 root 33r VREG 85,103 75144 591367 /usr -- LucidaBrightDemiBold.ttf java 11518 root 34r VREG 85,103 75124 591368 /usr -- LucidaBrightDemiItalic.ttf java 11518 root 35r VREG 85,103 80856 591369 /usr -- LucidaBrightItalic.ttf java 11518 root 36r VREG 85,103 344908 591738 /usr (/dev/md/dsk/d103) java 11518 root 37r VREG 85,103 208628 591370 /usr (/dev/md/dsk/d103) java 11518 root 38r VREG 85,103 698236 591371 /usr -- LucidaSansRegular.ttf java 11518 root 39r VREG 85,103 141272 591372 /usr (/dev/md/dsk/d103) java 11518 root 40r VREG 85,103 242700 591739 /usr (/dev/md/dsk/d103) java 11518 root 41u IPv4 0x30304088670 0t0 TCP *:44596 (LISTEN) java 11518 root 42u IPv4 0x3030320d0e0 0t0 TCP localhost:44594->localhost:44593 (BOUND) java 11518 root 43w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 44u IPv4 0x3000b1c50b0 0t0 TCP *:1111 (LISTEN) java 11518 root 45w unknown file system type (ufs), v_op: 0x78426120 java 11518 root 46u IPv4 0x312a03413e8 0x583332c4 TCP localhost:37472->localhost:21071 (ESTABLISHED) reedi.psmfc.org:C1:root: >
All the interesting bits are marked as "unknown file system type (ufs)". Bummer.
Who is doing what?
The goal is to match this dataset to a user and a query so that we can understand if something was done incorrectly.
Successfully completed reports are in /global/ds1/pitweb-1.0/cvswd. Seeing what has changed recently there tells us who is getting things done:
rufus.psmfc.org:C1:root: > ls -lt | head total 924 drwxr-xr-x 2 root other 3584 Jan 9 13:51 fmonzyk drwxrwxr-x 2 root other 12288 Jan 9 13:36 biolines drwxr-xr-x 2 root other 2560 Jan 9 12:34 rhino06 drwxr-xr-x 2 root other 512 Jan 9 12:03 Bummer drwxrwxr-x 2 root other 512 Jan 9 11:34 morrisj drwxr-xr-x 2 root other 2560 Jan 9 11:14 steveb drwxr-xr-x 2 root other 1024 Jan 9 11:10 rday drwxr-xr-x 3 root other 1024 Jan 9 11:03 gasvoda drwxr-xr-x 2 root other 1024 Jan 9 10:59 brianm
But these are reports that have already finished, not reports that are in progress. Still, it may be a clue about which users are currently active.
Interpret the partial data
One good clue about the owner of a report is to look for tag IDs and see who is the coordinator for those tags. This may correlate with the most active users as determined above.
