Replicate stored procedure crash on sockeye

From PTAGISWiki

Jump to: navigation, search

We experienced an ingres crash each of the four times we tried to load tagging file CFM09188.TUM on Sep 16, 2009.

The errlog.log indicated each time that the stored procedure Fd_aggre_mort_data_cdbp was the culprit.

I opened a ticket with Ingres and they suspect it is related to a known bug (119951) and want to reproduce the problem in house to verify the fix and have asked for the copy.out.

I'd like to replicate the problem on development first, to give me confidence that we understand the problem.

Dave has suggested that I submit the tag file CFM09197.PRD and then manually insert two orphan records into tag_data.

I found the file on blueback at /home/pittag/data/FDVL/loaded/tagging/CFM/2009

I removed the "-L_20" suffix and emailed it to pitdata@sockeye.psmfc.org with the subject "tagging"

The file was received, validated, and loaded.

I didn't insert the orphans yet.

I submitted the poison file to sockeye and it loaded without any crash.

Doug called.

Contents

Doug's advice

  • compare CBF parameters on blueback and sockeye
  • a snapshot of the database could be copied to sockeye, but it is a lot of work and it may squash current development on sockeye
  • a copy.out of just a few tables from blueback to sockeye could work but it may not capture enough of the state to trigger the bug and it involves the exercise of following all the foreign key dependencies to make sure that everything is internally consistent. So it may be much more than just a few tables that need to be copied.
  • considering the Ingres has a hunch about a bug and has just asked for the copy.out of the relevant tables, I should probably do so that they can be working on the problem while I try to replicate it on sockeye.

generating copy.out of relevant tables

What are the relevant tables?

The stored procedure operates directly on these tables:

  • tag_hdr
  • tag_data
  • mort_hdr
  • mort_data

For each table, the procedure looks like this:

 copydb -upittag ptagis3 <table>
 sql -upittag ptagis3 < copy.out
 mv copy.out <table>.copy.out
 mv copy.in <table>.copy.in

That exercise left me with these files to somehow transfer to Ingres:

-rw-r--r--   1 root     root        5.3G Sep 17 11:14 tag_data.pittag
-rw-r--r--   1 root     root         244 Sep 17 11:01 tag_data.copy.out
-rw-r--r--   1 root     root        3.1K Sep 17 11:01 tag_data.copy.in
-rw-r--r--   1 root     root        237M Sep 17 10:59 mort_data.pittag
-rw-r--r--   1 root     root         246 Sep 17 10:59 mort_data.copy.out
-rw-r--r--   1 root     root        3.1K Sep 17 10:59 mort_data.copy.in
-rw-r--r--   1 root     root        1.1M Sep 17 10:58 mort_hdr.pittag
-rw-r--r--   1 root     root         244 Sep 17 10:57 mort_hdr.copy.out
-rw-r--r--   1 root     root        2.5K Sep 17 10:57 mort_hdr.copy.in
-rw-r--r--   1 root     root         56M Sep 17 10:57 tag_hdr.pittag
-rw-r--r--   1 rday     pitadmin     242 Sep 17 10:55 tag_hdr.copy.out
-rw-r--r--   1 rday     pitadmin    3.6K Sep 17 10:55 tag_hdr.copy.in

I haven't tried to trace the dependencies to other tables yet.

continuing with crash replication on sockeye

I'm waiting to hear from Ingres about how I should send the 5.5GB of data they requested.

The next step on sockeye is to make mort_data conform to the state on blueback. I'm using this sql:

select m_file, tag_file, attmpt_seq, m_tagid from mort_data where m_file='CFM09188.TUM' order by tag_file

It produces this output on blueback:

Executing . . .


┌───────────────┬───────────────┬─────────────┬───────────────┐
│m_file         │tag_file       │attmpt_seq   │m_tagid        │
├───────────────┼───────────────┼─────────────┼───────────────┤
│CFM09188.TUM   │CFM09197.PRD   │            9│3D9.1C2D3402E6 │
│CFM09188.TUM   │CFM09197.PRD   │            9│3D9.1C2D2E010B │
│CFM09188.TUM   │ORPHAN         │            9│3D9.1C2D04EB88 │
│CFM09188.TUM   │ORPHAN         │            9│3D9.1C2C484015 │
│CFM09188.TUM   │               │            9│3D9.1C2D12DD60 │
│CFM09188.TUM   │               │            9│3D9.1C2D10ABD4 │
│CFM09188.TUM   │               │            9│3D9.1C2D0D9197 │
│CFM09188.TUM   │               │            9│3D9.1C2D04E2CE │

The query on both machines return 61 rows.

On sockeye, the two orphan records had blank tag_files, so I manually updated them to be orphans. Now the output on sockeye looks like this:

Executing . . .


┌───────────────┬───────────────┬─────────────┬───────────────┐
│m_file         │tag_file       │attmpt_seq   │m_tagid        │
├───────────────┼───────────────┼─────────────┼───────────────┤
│CFM09188.TUM   │CFM09197.PRD   │            4│3D9.1C2D3402E6 │
│CFM09188.TUM   │CFM09197.PRD   │            4│3D9.1C2D2E010B │
│CFM09188.TUM   │ORPHAN         │            4│3D9.1C2D04EB88 │
│CFM09188.TUM   │ORPHAN         │            4│3D9.1C2C484015 │
│CFM09188.TUM   │               │            4│3D9.1C2D12DD60 │
│CFM09188.TUM   │               │            4│3D9.1C2D10ABD4 │
│CFM09188.TUM   │               │            4│3D9.1C2D0D9197 │
│CFM09188.TUM   │               │            4│3D9.1C2D04E2CE │

Next I submitted the poison file via email to sockeye and watched the Ingres errlog.log...

It loaded with no errors:

Your TAGGING file CFM09188.TUM, which has been submitted 5 times, has been loaded to the ptagis3 database, updating content of the previous submittal.

Details follow:

 For this file the database now contains:
   3067 Records in the tag_data table
   61 Records in the mort_data table
   156 Records in the recap_data table
   3 Records in the tag_dot_out_data table

Next I manually inserted orphan records into tag_data on sockeye for the two orphans.

Then I resubmitted the poison file.

The file loaded with no errors.

Replicate state at application level

I worked with Dave to determine the list of all tags referenced in the poison tagging file. Given that list, Dave created an interrogation file that generated obs_data records for them all. Dave also made sure that we had the same orphans and dupes in sockeye that we had on blueback.

Dave then loaded each of the tagging files from TUM in order up to the poison file. Then we loaded the poison file and it did not crash sockeye.

Moving on

At this point, we will not spend further energy trying to replicate the problem on sockeye. I'll proceed by installing the latest SP and patches on sockeye and then on blueback. Then we'll try loading the poison file on production and hope that it does not crash.

See Plan for installing Ingres 2006 Release 2 SP2 on sockeye

Resolution

Ingres 2006 Release 2 SP2 and the latest patches were installed on blueback and we loaded the poison file to production without any adverse effects. Since the problem was intermittent, we are not certain that we have fixed the problem.

Dave modified /home/pittag/bin/FDVL/fdq_storeAggregationResults.pl to provide more logging if the problem should ever recur. The errors are written to MLOG at /tmp/fdq_storeAggregationResults.log.$$.

Personal tools