Custom BCP Queries

Overviews

Custom BCP Queries

db_snapshot & db_restore - Custom BCP Queries

Custom BCP Queries consists of two modifications:

Multiple Snapshot Files for One Table
BCP Query Configuration File

About Multiple Files for a Single Table

Custom BCP Queries are user-defined SQL queries used to produce partial datasets consisting of selected LeasePak data. This capability lends itself to custom extracts, as well as to the incremental backup of certain large tables, such as the rtx table. Because the pre-v6.6a db_snapshot command selects all rows from the tables it exports, this results a set of data files containing every row in the subject table.

With Custom Queries, it is possible that there will be multiple segments for a given table. In Oracle only, the export of data for a snapshot is actually performed in approximately 2,000,000,000 (why is this?) byte segments. Previously, these segments were concatenated into a single file at the end of the export of the table. In LeasePak release version v6.6a, db_snapshot running on Oracle will preserve these segmented files. The db_restore command has also changed to look for and correctly load multiple segments for a single table under Orcale.

Why Is This? "Approximately 2,000,000,000 bytes"

This is because there is an OS limit on the size of files. This can be overridden, but it is not convenient to do so, and can cause the system administrator a variety of headaches. The normal limit on maximum file size is 2GB (2,147,483,647 bytes). The utility that exports Oracle database rows checks the size of the output file after every row, and when it is over 2,000,000,000 bytes, the output file is closed and a new one opened. This means that no row in the database should exceed 147,483,647 bytes in export format, or else the write could fail as the OS will not allow bytes beyond the 2GB limit. The final row of each segment file will likely push the file size above 2 billion bytes, but under 2GB.

That's Why!

db_snapshot behavior - summary

When the snapshot process creates separate segments for the same table, it no longer concatenates these segments into one file.
When multiple segments for the same table are created, the name of the segment will have a period and a number added to the end of the file extension, and that number will be incremented by 1 for each additional segment.

db_snapshot behavior - details

A large table being exported through db_snapshot in an Oracle database will create multiple segments with a number added to the end which is incremented by 1 for each additional segment created. No individual file will be greater than 2GB in length. The end of the db_snapshot script’s processing for each table used to concatenate these multiple segments into one file containing the given table. db_snapshot no longer performs this concatenation under Oracle.

Multiple segments for the same table are named with an additional suffix made up of a period (.) and a number which starts at 1 and increases by 1 for each new segment. Therefore if the rtx table generates three segments, they will have the following names: rtx.bcp.1, rtx.bcp.2, rtx.bcp.3.

db_snapshot Output file names

File Extension	When produced
`.bcp`	produced by non-q snapshots — all Sybase tables; Oracle tables < 2-billion bytes
`.bcp.n`	produced by non-q snapshots from Oracle of very large tables also when `.pbcp` segments are renamed to `.bcp` in preparation for loading by db_restore
`.pbcp`	produced by query (-q) type snapshots of smaller Oracle tables
`.pbcp.n`	produced by query (-q) type snapshots of large Oracle tables
`.txt`	lastsnapshot.txt - created or expanded on each snapshot
`.txt`	terminators.txt - the set of record and field terminators used to delimit rows and columns of the current snapshot

No changes were made to db_snapshot running under Sybase. Sybase does not segment exported data files.

db_restore behavior - summary

Recognizes when multiple segments exist for the same table.

Under Oracle

When multiple segments exist for a table, all of these segments are loaded into that one table during the restore.
Under Sybase

When mulitple segments exist for a table, all of these segments are concatenated, and the multiple numbered segments are removed, then the concatenated file is loaded into the LLDB.

db_restore behavior - detail

The db_restore process recognizes when multiple segments exist for the same table. For instance, the files for rtp could exist as rtp.bcp.1, rtp.bcp.2, and rtp.bcp.3. This is still be recognized as the data for rtp. When multiple segments exist for one table, the restore process restores all segments into the one table.

As long as the name of each segment follows the naming convention given it will be restored. All LeasePak records are unique, so the sequence in which they are restored is immaterial. The –p <count> option governs how many processes can run concurrently to load data into the tables (at any given moment, where each process can be loading one segment, up to a maximum of 15 processes. Different segments for the same table can be loaded concurrently.

Because of special circumstances when doing a db_restore in an Oracle database using either -t or -x options, the process will look specifically for table.bcp.1 to determine if multiple segments exist. In a standard Oracle snapshot, table.bcp.1 will always exist if the table is segmented. In the custom query snapshots, the person creating that snapshot must make sure that table.bcp.1 is present for "db_restore -x" or "db_restore -t" to be successful.

In order to restore multiple segments to a Sybase database, the segments must be concatenated into one file first. The db_restore process under Sybase will concatenate all segments and replace the segments with the concatenated file in the data-set.

If a dataset with segmented files that is being restored under Sybase, and preserving the segmented files is desired, the operator must make a copy of the data-set before restoring.

If the user is constructing a data-set that contains only parts of tables, the possibility exists that subsequent bcp.# files may contain rows with keys identical to rows in other bcp.# files or already existing in the LLDB. If such duplicate keys are found during the db_restore process, the restore reports an error, leaving the table in an undefined condition, requiring that the table be truncated and all of the data segments be reloaded, after the duplicate row issue(s) have been resolved.

BCP Query Configuration File

A requirement for implementing Custom BCP Queries is where to store the custom-coded SQL source code. To address this need, the BCP Query Configuration File was created. This file, described below, is prepared by the site IT staff, and is introduced to the db_snapshot process via the -q queryfile:query-id parameter added to the db_snapshot command.

The configuration file for the query must be located in the $udata directory and it must have the .bcpqry extension.
When using the -q flag, no previously created snapshot files will be deleted.
When using the -q flag, the script will first check that there are no files existing in the output directory which begin with the filename and date being used and will stop with an error message if one is found.
The configuration file must define:
- ID for a given select query
- Name to use for the output file name
- SQL select statement

A flag option has been added to the db_snapshot script. The usage definition now is:

db_snapshot environment-name dataset [-p process-count]
[-t tables | -x tables | -q config_file:query_id] [dbo-password]

Use of the new optional parameter -q will then require that a configuration file be indicated along with the ID of the query within the configuration file to execute. The -q will use the dataset parameter to indicate where the output files will be located as it does today. However, the dataset directory will not be cleared when running with the -q as it is otherwise. Only one of the -t, -x, or -q flags can be used at in a single db_snapshot command. The -p is irrelevant with the -q because multiprocessing is not possible and therefore it will be ignored if included.

The configuration file must be placed in the environment's $udata directory. The configuration file must have bcpqry as its extension. Enter the configuration file name without the .bcpqry extension, after the -q flag. Immediately following the configuration file name in the db_snapshot command there must be a colon (:) followed by the ID of the query to be executed. For example, if the configuration file is named foo.bcpqry and the ID of the query to be executed is '29', the db_snapshot command will have -q foo:29 as parameters.

The configuration file contains from 1 to 50 query-blocks; each query-block has this structure:

<ID> query-id
- ID of the query.
IDs do not need to be in any particular order. If an ID is repeated in a subsequent query block, the first query block with a matching ID found will be used.
<FILENAME> filename base
- base name to use for the output file name.
The file name may be a table name but will not be validated as an actual valid table in the database and will not be verified to match a table in the query either. The file name cannot be identical to another file name run in parallel if it will be placed in the same dataset. In this case an additional component may be needed such as in rls.1, rls.2, rls.n, etc.
<SQL> sql-query
- SQL query beginning with "select .
Long queries may wrap over several lines, as long as the <SQL> keyword and the start of the query "SELECT are on the same line, and the constraint of line length (see above) is not violated. The query must be surrounded by double quotes (") and any nested quotations within the query must alternate between single quotes (') and double quotes ("). The quotation mark characters must be ASCII 0x27 (') (') and 0x22 (") ("). Smart quotes will be misinterpreted and cause an error.
The configuration file may also include comments by preceding them with a !, #, or ;. This can only be done on a separate line and cannot be added to the end of a line in the configuration file. See example below.
Each query-block is terminated by the keyword <END>. This keyword also serves to terminate the SQL statement. The keyword <END> must occur as the first non-white characters on the line on which it occurs. Any text on that line after the <END> keyword is a syntax error.
Conventionally, there is at least one line of whitespace between the <END> of one query block and the <ID> of the following block. This is optional, however. Comments and lines of whitespace are ignored by the parser used by db_restore.

These elements must be in the above order. They can be repeated in this order up to 49 times. The total text of an individual query block, <ID> through <END>, must be < 2048 characters. Each line must be < 200 words. A word is defined as a string of characters separated by whitespace at the beginning and end. Each word must be < 30 characters.

Whitespace consists of any unquoted occurrence of one or more characters from the set of space, horizontal tab, vertical tab, formfeed, newline, carriage return (0x20, 0x09, 0x0B, 0x0C, 0x0A, 0x0D).

Here is an example of a possible configuration file:

	<ID> base 
	<FILENAME> rtx 
	<SQL> "select * from rtx" 
	<END> 
	
	<ID> delta 
	<FILENAME> rtx 
	<SQL> "select rtx.* from rtx, ras 
	where rtx.por_s = ras.por_s 
	and (rtx.d_run_s > ras.d_prev_cmplt_day_s 
	or (rtx.d_run_s = ras.d_prev_cmplt_day_s 
	and rtx.t_run_s > ras.t_prev_cmplt_day_s))" 
	<END> 
	
	<ID> rlsgroup1 
	<FILENAME> rls.1 
	<SQL> "select * from rls where por_s <= '50'" 
	<END> 
	
	<ID> rlsgroup2 
	<FILENAME> rls.2 
	<SQL> "select * from rls where por_s > '50'" 
	<END> 
	
	<ID> rlsagroup1 
	<FILENAME> rlsa.1 
	<SQL> "select rlsa.* from rlsa, rls 
	where rlsa.lse_s = rls.lse_s and rls.por_s <=; '50'" 
	<END>

	<ID> rlsagroup2 
	<FILENAME> rlsa.2 
	<SQL> "select rlsa.* from rlsa, rls 
	where rlsa.lse_s = rls.lse_s and rls.por_s > '50'" 
	<END>

The output file(s) created by db_snapshot -q will be located in the dataset sub-directory (see About Datasets specified in the db_snapshot command. The files will be named using the structure of filename.yyyymmdd.pbcp.# where filename is the value for <FILENAME> in the configuration file, yyyymmdd is the current date at the beginning of the script execution, and # is the sequence number. The sequence number will begin at 1 and be incremented by 1 for each subsequent segment created, but no number will be added if only one file is created. No segment can be greater than 2GB and, therefore, larger output will be segmented into separate files using the sequence number to create unique names. For example rtx.20130329.pbcp will be the first file created when the filename is rtx and it is run on March 29, 2013. If multiple files need to be created due to more than 2GB the first file will be renamed using the sequence number "1".

This description of when a new file is added to a large table export, is approximate. We can say that each segment will end immediately after the record terminator of the record whose contents pushed the file over the 2-billion byte limit. Since it is very unlikely that a single row in any table will exceed the 147MB of padding available between 2 billion and 2 GB, we can safely say that no export file will exceed 2GB. See About Datasets.

When running db_snapshot -q, the script will first check the designated dataset directory to determine if any files exist which begin with the same filename and date as this run intends to use. In the example of a file named rtx.20130329.pbcp.1, the filename is rtx and the date is 20130329, so db_snapshot will check for any files named rtx.20130329.pbcp* before executing. If a file beginning with this name is found, db_snapshot will abort, and no further snapshot files will be created. Otherwise the processing will continue as described.