Replicate stored procedure crash on sockeye
From PTAGISWiki
We experienced an ingres crash each of the four times we tried to load tagging file CFM09188.TUM on Sep 16, 2009.
The errlog.log indicated each time that the stored procedure Fd_aggre_mort_data_cdbp was the culprit.
I opened a ticket with Ingres and they suspect it is related to a known bug (119951) and want to reproduce the problem in house to verify the fix and have asked for the copy.out.
I'd like to replicate the problem on development first, to give me confidence that we understand the problem.
Dave has suggested that I submit the tag file CFM09197.PRD and then manually insert two orphan records into tag_data.
I found the file on blueback at /home/pittag/data/FDVL/loaded/tagging/CFM/2009
I removed the "-L_20" suffix and emailed it to pitdata@sockeye.psmfc.org with the subject "tagging"
The file was received, validated, and loaded.
I didn't insert the orphans yet.
I submitted the poison file to sockeye and it loaded without any crash.
Doug called.
Contents |
Doug's advice
- compare CBF parameters on blueback and sockeye
- stack size, cache size? (results of comparing config.dat on blueback and sockeye)
- bring sockeye down to blueback levels if sockeye has more than blueback
- a snapshot of the database could be copied to sockeye, but it is a lot of work and it may squash current development on sockeye
- a copy.out of just a few tables from blueback to sockeye could work but it may not capture enough of the state to trigger the bug and it involves the exercise of following all the foreign key dependencies to make sure that everything is internally consistent. So it may be much more than just a few tables that need to be copied.
- considering the Ingres has a hunch about a bug and has just asked for the copy.out of the relevant tables, I should probably do so that they can be working on the problem while I try to replicate it on sockeye.
generating copy.out of relevant tables
What are the relevant tables?
The stored procedure operates directly on these tables:
- tag_hdr
- tag_data
- mort_hdr
- mort_data
For each table, the procedure looks like this:
copydb -upittag ptagis3 <table> sql -upittag ptagis3 < copy.out mv copy.out <table>.copy.out mv copy.in <table>.copy.in
That exercise left me with these files to somehow transfer to Ingres:
-rw-r--r-- 1 root root 5.3G Sep 17 11:14 tag_data.pittag -rw-r--r-- 1 root root 244 Sep 17 11:01 tag_data.copy.out -rw-r--r-- 1 root root 3.1K Sep 17 11:01 tag_data.copy.in -rw-r--r-- 1 root root 237M Sep 17 10:59 mort_data.pittag -rw-r--r-- 1 root root 246 Sep 17 10:59 mort_data.copy.out -rw-r--r-- 1 root root 3.1K Sep 17 10:59 mort_data.copy.in -rw-r--r-- 1 root root 1.1M Sep 17 10:58 mort_hdr.pittag -rw-r--r-- 1 root root 244 Sep 17 10:57 mort_hdr.copy.out -rw-r--r-- 1 root root 2.5K Sep 17 10:57 mort_hdr.copy.in -rw-r--r-- 1 root root 56M Sep 17 10:57 tag_hdr.pittag -rw-r--r-- 1 rday pitadmin 242 Sep 17 10:55 tag_hdr.copy.out -rw-r--r-- 1 rday pitadmin 3.6K Sep 17 10:55 tag_hdr.copy.in
I haven't tried to trace the dependencies to other tables yet.
continuing with crash replication on sockeye
I'm waiting to hear from Ingres about how I should send the 5.5GB of data they requested.
The next step on sockeye is to make mort_data conform to the state on blueback. I'm using this sql:
select m_file, tag_file, attmpt_seq, m_tagid from mort_data where m_file='CFM09188.TUM' order by tag_file
It produces this output on blueback:
Executing . . . ┌───────────────┬───────────────┬─────────────┬───────────────┐ │m_file │tag_file │attmpt_seq │m_tagid │ ├───────────────┼───────────────┼─────────────┼───────────────┤ │CFM09188.TUM │CFM09197.PRD │ 9│3D9.1C2D3402E6 │ │CFM09188.TUM │CFM09197.PRD │ 9│3D9.1C2D2E010B │ │CFM09188.TUM │ORPHAN │ 9│3D9.1C2D04EB88 │ │CFM09188.TUM │ORPHAN │ 9│3D9.1C2C484015 │ │CFM09188.TUM │ │ 9│3D9.1C2D12DD60 │ │CFM09188.TUM │ │ 9│3D9.1C2D10ABD4 │ │CFM09188.TUM │ │ 9│3D9.1C2D0D9197 │ │CFM09188.TUM │ │ 9│3D9.1C2D04E2CE │
The query on both machines return 61 rows.
On sockeye, the two orphan records had blank tag_files, so I manually updated them to be orphans. Now the output on sockeye looks like this:
Executing . . . ┌───────────────┬───────────────┬─────────────┬───────────────┐ │m_file │tag_file │attmpt_seq │m_tagid │ ├───────────────┼───────────────┼─────────────┼───────────────┤ │CFM09188.TUM │CFM09197.PRD │ 4│3D9.1C2D3402E6 │ │CFM09188.TUM │CFM09197.PRD │ 4│3D9.1C2D2E010B │ │CFM09188.TUM │ORPHAN │ 4│3D9.1C2D04EB88 │ │CFM09188.TUM │ORPHAN │ 4│3D9.1C2C484015 │ │CFM09188.TUM │ │ 4│3D9.1C2D12DD60 │ │CFM09188.TUM │ │ 4│3D9.1C2D10ABD4 │ │CFM09188.TUM │ │ 4│3D9.1C2D0D9197 │ │CFM09188.TUM │ │ 4│3D9.1C2D04E2CE │
Next I submitted the poison file via email to sockeye and watched the Ingres errlog.log...
It loaded with no errors:
Your TAGGING file CFM09188.TUM, which has been submitted 5 times, has been loaded to the ptagis3 database, updating content of the previous submittal. Details follow: For this file the database now contains: 3067 Records in the tag_data table 61 Records in the mort_data table 156 Records in the recap_data table 3 Records in the tag_dot_out_data table
Next I manually inserted orphan records into tag_data on sockeye for the two orphans.
Then I resubmitted the poison file.
The file loaded with no errors.
Replicate state at application level
I worked with Dave to determine the list of all tags referenced in the poison tagging file. Given that list, Dave created an interrogation file that generated obs_data records for them all. Dave also made sure that we had the same orphans and dupes in sockeye that we had on blueback.
Dave then loaded each of the tagging files from TUM in order up to the poison file. Then we loaded the poison file and it did not crash sockeye.
Moving on
At this point, we will not spend further energy trying to replicate the problem on sockeye. I'll proceed by installing the latest SP and patches on sockeye and then on blueback. Then we'll try loading the poison file on production and hope that it does not crash.
See Plan for installing Ingres 2006 Release 2 SP2 on sockeye
Resolution
Ingres 2006 Release 2 SP2 and the latest patches were installed on blueback and we loaded the poison file to production without any adverse effects. Since the problem was intermittent, we are not certain that we have fixed the problem.
Dave modified /home/pittag/bin/FDVL/fdq_storeAggregationResults.pl to provide more logging if the problem should ever recur. The errors are written to MLOG at /tmp/fdq_storeAggregationResults.log.$$.
