Fdvl
From PTAGISWiki
Contents |
Overview
FDVL is the Field Data Validation and Load process of PTAGIS. This process is responsible for loading Tag file information into the PTAGIS database.
Field Data Validation and Load
Field Data Validation and Loading
DESIGN NOTES
Doug Clough, SYNERGETICS Engineered Systems
Version 1.2, 1 Mar 99
Introduction
Summarized below are design requirements and strategies for re-work of the Field Data Loader, and ancillary processes, now in progress. An overview of the new system is presented in Fig. 1. As details are developed, they will be described below and illustrated by additional figures, where this would help clarify important concepts. A development "To Do List" is maintained at the end of the document.
General Requirements
As a guiding principle, field data files are "the truth." Data stored in tables of the PTAGIS3 database merely represent "the truth," and must be in demonstrable agreement with contents of the corresponding field data files.Because errors may be made in expressing "the truth," the file with a given name might be submitted on multiple occasions for loading and re-loading into the database, until all errors, omissions, and inconsistencies have been corrected.
File Archival and Retrieval: Loading Attempts
Let us identify loading attempts by number, beginning with 1.For each file, the system must insert one record into the fd_file table, as illustrated in Fig. 2. Each record contains:
- file name
Column name is fd_file_id
- submittal date and time
Updated each time a version of the file is submitted
- archival directory specification
All versions of an input file will be stored in the same directory.
Directory spec will be composed of three elements, each stored
in a separate column:
Element Column Remarks
------------------------------------------------------------------------------
Type-specific path type_path Name of environment variable
which contains type-specific
path. Eg. for a tagging file,
type_path = TAG_ARCH, and this
might translate to:
/home/pittag/data/tagging/loaded
Year-specific directory year_dir 4-digit representation of year
Domain-specific dir domain_dir 2- or 3-character "domain identifier"
NOTE: These elements provide information necessary to locate any field
data file, at any point in time. When a file is first processed,
FD_Validator will insert a row into the fd_file table, setting
type_path in accord with the file type, and setting year_dir based
on the current date.
The "domain identifier" will be determined in the following manner,
depending on file type:
File Type Domain ID / Found in ...
------------------- --------------------------------------------
Tagging "Domain" identified by initials of tagging
coordinator. This appears as (a) the leading
alphabetic characters of the external file
name ("tagging supervisor") or as (b) content
of the COORDINATOR ID record within the file.
** Are (a) and (b) the same in this context? **
Release Information "Domain" identified by initials of tagging
coordinator. These appear as characters "XXX"
of the external file name, which has the
format "RELYYXXX.ZZZ"
Interrogation "Domain" identified by interrogation site
code. This appears as the three leading alpha
characters of the external file name. Note
that these files are loaded by the "IDL" process.
FD_Validator does not see these files. Thus,
** fd_file will contain no entries for Interrogation files **
Monitored Release "Domain" identified by monitored release site
code. This appears as the three leading alpha
characters of the external file name.
Mortality "Domain" identified by initials of tagging
coordinator. This appears as content of the
COORDINATOR ID record within the file.
** Domain ID cannot be determined from external file name **
On any given loading attempt,
FD_Loader will consult the fd_file table, then place the current
version of the file as described below:
Current Date Path for Processed File
------------------------------------------------------------------------
( Year ID = year_dir ) type_path/domain_dir/file_name:L_nn
( Year ID > year_dir ) type_path/year_dir/domain_dir/file_name:L_nn
where
L_nn is as described below
- loading-attempt sequence number
Incremented each time a version of the file is submitted
- "status" of the loading attempt
This value must be updated as the file moves through the process:
RECEIVED, STAGED, ..., CompletionStatus <--( See below )
- date and time of "status"
Updated each time "status" changes
If all field data files loaded into the database were correct, it ought to be possible to re-submit any collection of files any number of times without generating error messages or changing the contents of the database -- except for changes in time-stamps and status indicators as indicated above. Conversely, inserts, updates, and deletes in the database must correspond exactly to valid changes in the associated field-data files as seen in successive load attempts.
For audit, and for recovery from system failure or human error, the file processed in each loading attempt must be saved on disk. To avoid over-writing the results of previous attempts, the attempt-number must be appended to the file name before the file is moved to the archival directory. Thus, archived files must be named as follows:
archival_directory/data_file_name:L_attempt_number
where
data_file_name conforms to PTAGIS Spec Doc
Completion Status: Processing of Duplicate Records
Each loading attempt must have a CompletionStatus as follows:
REJ_SPEC - File structure or content did not conform to PTAGIS
Specification Document. No attempt was made to load
file content into the database. However, diagnostic
messages were provided to facilitate corrections.
REJ_DUPS - One or more PIT-tag id's contained in the file were
identified as:
- Appearing more than one time in the same context
in this file (e.g. more than one occurrence of
the tag-id in the context of "tagging
information," or in the context of "release
information"). Call this SELF_DUP
- Appearing in another file in the same context
as in this file (e.g. another file already
incorporates the tag-id in the context
of "tagging information," or in the context
of "release information"). Call this OTHR_DUP
OTHR_DUP tags and SELF_DUP tags were identified in
a diagnostic message. No attempt was made to load
file content into the database.
NET_DUPS - SELF_DUP "mort" tags were identified in a diagnostic
message. The first occurrence of each SELF_DUP "mort"
was loaded to mort_data. Further occurrences were
written to the REJECTS file, to the mort_dup_data table,
or both -- we agreed to postpone a decision on this for now.
Note that duplicates are expected and acceptable in
the following circumstances:
Type of Information Comment
------------------------ -----------------------------
Recapture A given tag-id may appear in
multiple field data files as
a recapture. Barring other
problems, a file with OTHER_DUP
"recaptures" will be processed
and marked with LOADED status.
However, if SELF_DUP "recaptures"
are encountered, processing will
terminate with REJ_DUPS status.
Further, if a given tag-id is
identified as a "recapture" in
the same file where it appears
in the context of an "original
tagging event", processing will
terminate with REJ_DUPS status.
Mortality A given tag-id may appear in
multiple field data files as
a "mort." Barring other problems,
a file with OTHER_DUP "morts"
will be processed and marked with
LOADED status.
Further, considering data collection
methods in use at Rice Island, for
example, a given tag may appear many
times in the same file. Barring other
problems, a file with SELF_DUP "morts"
will be processed and marked with
NET_DUPS status.
LOADED - All rows were loaded into the database.
Audit Process
For a given file name, contents of PTAGIS3 tables must match the "net content" of the most recent NET_DUPS or LOADED attempt. To demonstrate that this requirement is met, an audit mechanism must be built according to the following conceptual design:
Data File --> FD_Validate --> Intermediate File <-- Do they Match?
| |
v +--------^------------+
FD_Load --> REJECTS + AUDIT_FILE
| ^
v |
PTAGIS3 ------------> FD_Audit
Note that the REJECTS file could be generated by running a report against the xxx_dup_data table (e.g. mort_dup_data, tag_dup_data) if we opt to load the duplicates into the database. Alternatively, it would simply reside in the same directory as the corresponding Intermediate File.
The transformation from Data File to Intermediate File may be "designed out" of the system when the structure of field data files is "rationalized" in the future. In the meanwhile, it will be necessary to ensure that FD_Validate doesn't corrupt the information presented to it in the raw Data File. It is anticipated that FD_Audit will be run frequently during development, testing, and early production use of the new system. However, once behavior of the system stabilizes, it is possible that FD_Audit would be run only on a "spot-check" basis, or to help diagnose and resolve specific data-quality issues as they might arise.
Specific Requirements
Enumerated below are processing requirements specific to individual file-types.Tagging Files: Interpretation of Flag Codes
Tagging flag codes may be described as being of two general types:- 1. Data-Bearing
- Convey information about the associated fish
- 2. Context-Setting
- Identify the "context" (e.g. "recapture", "mortality") in which the information was gathered
- MS --> Mortality (Intentional sacrifice)
- SM --> Mortality (Subsequent mort)
- M --> Mortality (Mortality)
- RE --> Recapture
Design Strageties
Validate, load, and archive files in such a manner that the user may deal with any file in its entirety in every loading attempt. If the validation process identifies errors during one loading attempt, the user will make corrections and resubmit the entire file.Similarly, process each file in its entirety on every loading attempt. Flag any invalid data, but perform inserts, updates, and deletes as necessary to bring contents of the target tables into agreement with valid information contained in the file. Employ the following strategy:
To each "header" table (e.g. tag_hdr) add two columns:
Column Purpose and Usage
----------------- -------------------------------------------
1. load_nbr Beginning with 0, and incremented each
time the associated file is processed,
identifies the most recent load attempt.
Corresponds to "L_nn" suffix in file name:
archival_directory/data_file_name:L_attempt_number
2. load_date_time Date and time of most recent load attempt
To each "detail" table (e.g. tag_detail) add one column:
Column Purpose and Usage
----------------- -------------------------------------------
1. load_nbr On each load attempt, every row INSERTED
or UPDATED is marked with the attempt_nbr
value in the corresponding row of the
associated "header" table. When the inserts
and updates have been completed, any
detail row bearing a smaller attempt_nbr
must be DELETED, as it is not represented
in the most recent version of the field
data file. NOTE: Each table will require
a secondary index on load_nbr.
To avoid concurrency problems, continue the practice adopted in design of the Interrogation Data Loader, of keeping transactions as small as possible. This means that files will be "staged" and "loaded" one row at a time. All rows in each input file will be processed, and every file will be processed "as a correction," following the strategy outlined above.
Next Steps
DONE 1. Modify transaction processing strategy incorporated in current
implementation of FD_Validator and FD_Loader to conform with
ideas presented above. (Completed: 1 Feb 99)
DONE 2. Define table to track load attempts of each field data file.
Build mechanisms for manipulating fd_file, and for disposing
processed files to appropriate directory.
(Initial implementation completed: 3 Feb 99)
- Change table definition to incorporate domain_dir
instead of week_dir, per meeting of 25 Feb
- Build tool for moving field data files from current directory
to new directory structure, making appropriate entries in
fd_file table, per meeting of 25 Feb
- Based on new directory structure, prototype web-page interface
allowing easy user access to any field-data file, per meeting
of 25 Feb
3. Implement parsing and validation mechanism driven by meta-data
(Initial implementation tested OK: 11 Feb 99)
(Incorporated into fdv_checkConformity.pl, tested: 16-19 Feb 99)
NOW -> - Design and implement data-structures for mapping file fields
to table columns (In progress: 23 Feb 99)
- Enhance parsing and validation mechanism to dispose data in
accordance with mapping rules
NEXT-> - Enhance Spec Doc meta-data to incorporate precedence rules for
"context-setting" Flag Codes, per meeting of 25 Feb
- Enhance parsing and validation mechanism to act on precedence
rules, per meeting of 25 Feb
4. Implement initial version of FD_Audit.
5. Incorporate "load_nbr" into tables in p3_dev database, and into
FDV and FDL modules.
END
---------|----------|----------|----------|----------|----------|----------|Modification Requests
1. Mort Dates
I was alerted today to a failure of FDVL to assign a 'Mort_Date' value to the ptagis3.mort_data table unless that date is explicitly reported using a Variable Release Time definition. In the absence of a VRT value, FDVL fails to capture the default "Release Date" value reported in the header of the P3 tag/release/recovery data file. This failure appears to be specific to mortality events, and is not evident in our tag_data or recap_data event records.
2. See FDVL_Mods_DISOWN
