Splitting an existing replicat ( or extract ) with the RANGE command into multiple replicats (extracts)

Splitting an existing replicat ( or extract ) with the RANGE command into multiple replicats (extracts) (Doc ID 1320133.1)

Goal

Replicating a high volume of change data in a single data stream (one extract, one pump, one replicat, for example) often does not provide enough throughput to maintain latency requirements for real-time data feeds between source and target systems. Understanding how to split the data into evenly distributed data streams is fundamental to increasing throughput.

 

After a reasonable sample of change data has been captured and written to a trail, the trail data will be used as the primary input to methodically spread the tables across multiple replication processes for increased throughput. This is done by extracting change data over a representative period of time that might either reflect a specific load test or typical production transaction mix.

 

Once the trail(s) exist a feature of the logdump utility is then used log table details (bytes, inserts, updates, deletes, etc.) to a loosely structured text file. Before this data is used in a spreadsheet the log dump output will first be reformatted into horizontal records with one row per table and values separated by commas. Reformatting is done using standard UNIX commands strung together at the command line or in shell scripts.

 

Finally, the formatted data is imported into MS Excel (or any spreadsheet application) and percentages are calculated based on change data bytes. The resulting spreadsheet will then provide the insight needed to evenly distribute the data load across multiple processes. The spreadsheet may also double as a summary report to customers.

 

Undoubtedly some tables will account for a greater percentage of the change data than other tables. In instances when, for example, one table accounts for 90% or all the change data bytes then the load balancing method described here will highlight this fact but balancing the load across multiple process will typically require the use of additional methods such as the of the “range” function in table or map statements. Ranges are not covered in this document.

 

Fix

Overview

Replicating a high volume of change data in a single data stream (one extract, one pump, one replicat, for example) often does not provide enough throughput to maintain latency requirements for real-time data feeds between source and target systems. Understanding how to split the data into evenly distributed data streams is fundamental to increasing throughput.

 

After a reasonable sample of change data has been captured and written to a trail, the trail data will be used as the primary input to methodically spread the tables across multiple replication processes for increased throughput. This is done by extracting change data over a representative period of time that might either reflect a specific load test or typical production transaction mix.

 

Once the trail(s) exist a feature of the logdump utility is then used log table details (bytes, inserts, updates, deletes, etc.) to a loosely structured text file. Before this data is used in a spreadsheet the log dump output will first be reformatted into horizontal records with one row per table and values separated by commas. Reformatting is done using standard UNIX commands strung together at the command line or in shell scripts.

 

Finally, the formatted data is imported into MS Excel (or any spreadsheet application) and percentages are calculated based on change data bytes. The resulting spreadsheet will then provide the insight needed to evenly distribute the data load across multiple processes. The spreadsheet may also double as a summary report to customers.

 

Undoubtedly some tables will account for a greater percentage of the change data than other tables. In instances when, for example, one table accounts for 90% or all the change data bytes then the load balancing method described here will highlight this fact but balancing the load across multiple process will typically require the use of additional methods such as the of the “range” function in table or map statements. Ranges are not covered in this document.

 

Logdump

 

Using logdump to generate a detail report on trail data is pretty straight forward but formatting the data so that it can be used in Excel requires reformatting the data, which is done here by using UNIX shell scripts. First we’ll look at generating the detail report for logdump manually and then doing the same via a shell script.

 

Running Logdump Manually

Like all GoldenGate core product executables starting with 9.5, logdump must be run directly from the installation directory or the associated libraries will not be found regardless of environmental library path settings.

 

 

 

1. Change directories to the GoldenGate home (here identified by the environment variable GGS_HOME but this can be substituted by the actual path name).

 

 

 

 

cd $GGS_HOME

or

cd <full path of GG installation>

 

2. Start logdump

 

 

./logdump

 

3. Log the output to a text file

 

 

Logdump > log logdump_output01.txt

— Session log logdump_output01.txt opened 2009/05/27 11:28:29.540.030 —

 

4. Issue a detailed count on one or more trail files. (Wild cards and some regular expressions may be used.)

 

 

Logdump > detail on

Logdump > count ./dirdat/1a00*

or

Logdump > count detail ./dirdat/1a00*

 

Output will be both echoed to the screen and saved in the file named indicated by the log command. If the file does not exist logdump will create it. If the file already exists logdump will append output to the file so be careful to rename or remove the file prior to subsequent logdump runs or the files will need to be edited manually to remove redundant data.

 

5. Exit logdump

 

 

Logdump > exit

 

 

Running Logdump via a Script

The steps in the preceding section can also be scripted as shown below (See Attachment)

 

#!/bin/sh

LOGDUMP_OUTPUT_FILE=logdump_output_$$.txt

GGS_HOME=/home/ggconsul/ggs/ggs103

TRAIL_NAME=$GGS_HOME/dirdat/1a*

cd $GGS_HOME

./logdump << EOF > $LOGDUMP_OUTPUT_FILE

detail on

count $TRAIL_NAME

exit

EOF

echo “Logdump output can be found in the file: $LOGDUMP_OUTPUT_FILE”

 

The output log file will look something similar to:

 

Logdump 2243 >count dirdat/e1a/*00*

Current LogTrail is /apps/ggsadm/ggsm1/dirdat/e1a/1a000000

LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/1a000000 has 249072 records

LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/1a000000 closed

. . .

LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/*00* has 1997810 records

 

Total Data Bytes 2944052112

Avg Bytes/Record 1473

Delete 3811

Insert 1972364

FieldComp 20873

GGSPKUpdate 754

Others 8

Before Images 3811

After Images 1993991

 

Average of 12330 Transactions

Bytes/Trans ….. 7777

Records/Trans … 162

Files/Trans ….. 0

 

*FileHeader* Partition 0

Total Data Bytes 9321

Avg Bytes/Record 1165

Others 8

 

XXCCS_GGT_O.GGS_HEARTBEAT Partition 4

Total Data Bytes 117001

Avg Bytes/Record 154

FieldComp 5

GGSPKUpdate 754

After Images 759

 

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4

Total Data Bytes 1504432167

Avg Bytes/Record 1349

Delete 3811

Insert 1090023

FieldComp 20868

Before Images 3811

After Images 1110891

 

Formatting with UNIX

 

Reformatting Lodump Data: Quick Minimalist Method

Covered in this section is how to get the minimum amount of data that is needed to load balance, which includes the combination of table owner with table name and the associated bytes. Load balancing is always done based on bytes.

 

Thus we will focus on two key lines in the logdump report.

 

1. The line that contains the string “Partition 4”, which is originally an NSK thing but can be consistently associated with and only with the same line that contains owner and table names. This will eliminate unwanted data such as summary data for all tables and in GG 10+ the trail file header data

 

2. The line immediately following “Partition 4” will always begin with the line “Total Data Bytes”. If the table shows up in the trail then it is implicit that there will be associated bytes. DDL bytes associated with a table are not included in the report

 

Quickest with Linux

The simplest way to do this is on Linux, which has an extended “grep” function that allows a line to be pulled (or “grepped”) along with any number of lines above or bellow. If the system where logdump runs is not Linux the text file can still be copied to a Linux machine or VM (virtual machine) where quick formatting can occur.

 

Here we will grep the line with “Partition 4” and the immediately following line by using the “-A” option for grep on Linux followed by the number of subsequent lines (1):

 

grep -A 1 “Partition 4” logdump_output_1234.txt

 

This will produce output such as:

 

XXCCS_GGT_O.GGS_HEARTBEAT Partition 4

Total Data Bytes 117001


XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4

Total Data Bytes 1504432167


XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4

Total Data Bytes 1439493623

 

A consequence of using the “-A” option is that it puts a line with “–” between each set of lines pulled out (the lined grepped plus the one after). These lines will be deleted using the UNIX sed (stream editor) command because it will facilitate joining every other line when next we use the UNIX paste command.

 

Once lines are joined, only the first column (owner.tablename) and seventh column (total bytes for that table) will be printed including a comma separating the two values. Finally, the output will be redirected to a new file that can be imported into Excel or viewed for a quick eyeball comparison (see section 4.1.2.1 for detail on how to sort the output by bytes).

 

All this can be done on one line by UNIX’s ability to “pipe” (|) the output of one command into the input of the next command allowing several commands to be strung together. The following should be typed on one line followed by a return.

 

grep -h -A 1 “Partition 4” logdump_output_1234.txt |sed ‘/^\-\-/d’|paste – -|awk ‘{print $1 “,” $7}’ > logdump_output_simple.csv

 

The output values will be separated commas which can be imported into Excel as a CSV (comma separated values) file which will look something like:

 

XXCCS_GGT_O.GGS_HEARTBEAT,117001

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H,1504432167

XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL,1439493623

 

Second Quickest with all UNIX-like Systems

 

If not using Linux (or even if using Linux) then the next simplest method to pull out the owner names, table names and bytes is to use the extended grep (egrep) to grab lines based one of multiple expressions and then redirect the output to a temporary file. Next, the file will be manually edited as instructed below. Then using the UNIX paste and awk command to print out only needed data from the temporary file, this will be redirected to CSV file that can be imported to a spreadsheet. The following should be typed on one line followed by a return.

 

egrep “Partition 4|^Total Data Bytes” logdump_output_1234.txt > logdump_output_simple.tmp

 

Sample output is:

 

Total Data Bytes 2944052112

Total Data Bytes 9321

XXCCS_GGT_O.GGS_HEARTBEAT Partition 4

Total Data Bytes 117001

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4

Total Data Bytes 1504432167

XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4

Total Data Bytes 1439493623

 

Notice that there are more lines starting with “Total Data Bytes” than there are table names listed. Delete all line beginning with “Total Data Bytes” that do no immediately follow a table name (denoted by “Partition 4”) as these have to do with summary data for all tables and in GG 10+ will include trail header data.

 

The manually altered file will look like:

 

XXCCS_GGT_O.GGS_HEARTBEAT Partition 4

Total Data Bytes 117001

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4

Total Data Bytes 1504432167

XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4

Total Data Bytes 1439493623

 

In the final step to prepare the text file for input into Excel, join every other line with the UNIX paste command and then print out (awk) only the first and seventh columns and separate values by commas. The following should be typed on one line followed by a return.

 

cat logdump_output_simple.tmp|paste – -|awk ‘{print $1 “,” $7}’ > logdump_output_simple.csv

 

The output will be:

 

XXCCS_GGT_O.GGS_HEARTBEAT,117001

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H,1504432167

XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL,1439493623

 

I Don’t Care About Excel; I Just Want a Quick, Sorted Look

 

If a quick look is all that is needed to identify the heaviest tables then the last part of the last example (after the manual file edit) could be rewritten to sort in descending order based on bytes as follows (the following should be typed on one line followed by a return):

 

cat logdump_output_simple.tmp|paste – -|awk ‘{print $7 “\t” $1}’|sort -nr

 

With output as:

 

1504432167 XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H

1439493623 XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL

117001 XXCCS_GGT_O.GGS_HEARTBEAT

 

In the above example the seventh column containing the bytes is written first, followed by a tab (“/t”) for readability, then the owner.table name. The -n option tells sort to sort numerically instead of the default alphabetically while the -r option indicates to reverse the order making it descending so that the heaviest tables are highest on the list.

 

Reformatting Lodump Data: Complete Method

The previous section covered only that which is needed to load balance, namely the owner.table names and associated number of bytes. This section is more script intensive as it will cover formatting data for all reported details for each table. These details can be used to generate a more complete spreadsheet report for the customer and/or consultant. The possible details for each table include the following (always in this order):

 

Total Data Bytes

Number of data bytes each table took up in the trail (minus header and meta data?)

 

Insert

Number of inserts

 

Delete

Number of deletes

 

FieldComp

Number of non-primary key updates

 

GGSPKUpdate

Number of primary key updates

 

Before

Before bytes

 

After

After bytes

 

 

Looking again at a portion of the sample logdump output we see:

 

XXCCS_GGT_O.GGS_HEARTBEAT Partition 4

Total Data Bytes 117001

Avg Bytes/Record 154

FieldComp 5

GGSPKUpdate 754

After Images 759

XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4

Total Data Bytes 1504432167

Avg Bytes/Record 1349

Delete 3811

Insert 1090023

FieldComp 20868

Before Images 3811

After Images 1110891

 

This data is parsed using the script parse_count_detail_vXY.sh, which is provided in Appendix A. However, the pseudo logic and usage is briefly described below.

 

1. Print column header information

 

2. Read each line of input and put each element (they are separated by but do not include white space) into an array

 

3. If the second array element is “Partion” and the third “4” then flag a new table record as found and split the first element based on the period (.) into the owner name and table name

 

4. Read each line of the record and capture the value for each of the following:

Insert

Delete

FieldComp

GGSPKUpdate

Before

After

 

5. If the first element of the line is “After” then print out all the saved table data to the screen and a file and reset the numbers and unset the found flag

 

6. Return to 3 until end of file

 

7. At the end of file print to screen the name of file with the formatted output and exit

 

Basic usage of the parse_count_detail_vXY.sh script consists of passing in the file name that contains the logdump count detail output.

 

./parse_count_detail_v01.sh logdump_output_1234.txt

 

Output is:

 

Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal

XXCCS_GGT_O,GGS_HEARTBEAT,117001,0,0,5,754,759,0

XXCCS_GGT_O,ST_KS_CSI_ITEM_INSTANCES_H,1504432167,1090023,3811,20868,0,1110891,3811

XXCCS_GGT_O,ST_KS_XXCSS_QOT_LINES_ALL,1439493623,882341,0,0,0,882341,0

*****************************************

Excel friendly, comma delimited can be found in: count_detail_output_21768.txt

*****************************************

The output file contains all but the last three lines of what was printed to the screen and can be imported to Excel directly without modification.

 

Using Formatted Data in Excel

 

 

The following can be done with any spreadsheet application but Excel is used throughout this section because of its ubiquity.

 

Importing Formatted Data into Excel

Once the output has been formatted, saved and brought to a machine with Excel installed it can be imported using the subsequent steps.

 

Appendix A

 

Complete Script: parse_count_detail_v01.sh (See Attachment)

The following korn shell (ksh) script will parse the output from a count detail done with the logdump utility and produce commas separated value (CSV) lists in a file. This file will contain all table details for each table with each table name and associated data on a separate line.

 

The text of this script is further below.

 

If there is trouble running this script it is probably due to one or two reasons.

 

1. The script needs to be made executable. To do this issue the following command:

 

chmod 754 parse_count_detail_v01.sh

 

2. The script has hidden DOS carriage returns (they look like a ^M and the end of each line). First check to see if there are ^M characters with:

 

cat -tv parse_count_detail_v01.sh

 

To get rid of these hidden characters issue either of the following commands:

 

dos2unix parse_count_detail_v01.sh parse_count_detail_v01.sh

 

or if dos2unix in not installed:

 

sed ‘s/.$//’ parse_count_detail_v01.sh

________________________________________________________________________

 

#!/bin/ksh

# ********************

# Created by: debuzna, GoldenGate Software 2009

# Usage: <this_file> <output from logdump “count detail”>

# Description: This script will parse the unstructured “count detail” data

# from logdump and output the each table data record per row in a comma

# delimited format.

# For more information see the GoldenGate Best Practice document:

# Using the Goldengate Logdump Utility to Manually Load Balance

# ********************

# ********************

# Function to check of file exists and is readable

# ********************

function checkfile {

#Debugging: echo “Input file in checkfile is: $1”

if [ -r $1 ]

then

echo “Oops… $1 cannot be opened for reading!”

exit 1

fi

}

# ********************

# Check for input parameter

# ********************

if [ $# -ne 1 ]

then

echo “Input file from logdump “count detail” is required…”

echo “Usage: $0 <logdump_count_detail_file>”

exit 1

fi

# Input file to process

INPUT_FROM_LOGDUMP=$1

# Call function to check if input file is readable

checkfile INPUT_FROM_LOGDUMP

#Debugging: echo “Input is: $INPUT_FROM_LOGDUMP”

# This is the file to where output will be saved. $$ is current process ID

OUTPUT=”count_detail_output_$$.csv”

# Comma delimited header for $csvstmt data

OUTPUT_HEADER=”Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal”

# First run, echo output header

echo $OUTPUT_HEADER

# Use redirect to overwrite output file if it happens to exist

echo $OUTPUT_HEADER > $OUTPUT

# Time traking variables for reporting

starttime=`date`

#Debugging: echo “Starting at: $starttime”

# ********************

# Set defaults

# ********************

# firstrun will ensure the header is printed only at output begining

firstrun=1

# Zero out table stats because not all tables take all operations

COUNT_INSERT=0

COUNT_DELETE=0

COUNT_UPDATE=0

COUNT_PKUPDATE=0

COUNT_BYTES=0

COUNT_BEFORE=0

COUNT_AFTER=0

TOTAL_BYTES=0

# ********************

# Parse the logdump count detail data and output to stdout & output file

# ********************

isfound=0

isdone=0

# Parse each line from $INPUT_FROM_LOGDUMP (below)

while read someline; do

# In sh the ()’s should convert someline into an array but seems buggy

# token=(${someline})

# So instead we’ll use the ksh set -A trick here to break string into an arrray

set -A DATA ZERO $someline

# The string “Partition 4” indicates the line starting with owner.table name

if [ “${DATA[2]}” = “Partition” -a “${DATA[3]}” = “4” -a $isfound -eq 0 ] ; then

# echo “Found partition and isfound=0”

# Debugging: echo “************************* found partition”

isfound=1

isdone=0

# Use another ksh-only trick to split first array element on the “.”

OWNERNAME=${DATA[1]%.*}

TABLENAME=${DATA[1]#*.}

#Debugging: echo “OWNERNAME=$OWNERNAME and TABLENAME=$TABLENAME”

elif [ $isfound -eq 1 ] ; then

# We have found a table record so parse it

#Debugging: echo “isfound=1”

# Record based on relavant keywords

case ${DATA[1]} in

Insert)

COUNT_INSERT=${DATA[2]}

#Debugging: echo Found insert

;;

Delete)

COUNT_DELETE=${DATA[2]}

;;

FieldComp)

COUNT_UPDATE=${DATA[2]}

;;

GGSPKUpdate)

COUNT_PKUPDATE=${DATA[2]}

;;

Total)

COUNT_BYTES=${DATA[4]}

TOTAL_BYTES=$(($TOTAL_BYTES + COUNT_BYTES))

;;

Before)

COUNT_BEFORE=${DATA[3]}

;;

After)

# Finding “After” means we are at the end of the record. Grab the after count

# create the

COUNT_AFTER=${DATA[3]}

csvstmt=”$OWNERNAME,\

$TABLENAME,\

$COUNT_BYTES,\

$COUNT_INSERT,\

$COUNT_DELETE,\

$COUNT_UPDATE,\

$COUNT_PKUPDATE,\

$COUNT_AFTER,\

$COUNT_BEFORE”

# Send table data to output file

echo $csvstmt >> $OUTPUT

# Send table data to standard output

echo $csvstmt

# Resent counter values

isfound=0

isdone=1

COUNT_INSERT=0

COUNT_DELETE=0

COUNT_UPDATE=0

COUNT_PKUPDATE=0

COUNT_BYTES=0

COUNT_BEFORE=0

COUNT_AFTER=0

TOTAL_BYTES=0

;;

*)

# Do nothing

#Debugging: echo “Non-canidate row”

;;

esac

fi

if [ $isdone -eq 1 ]; then

isdone=0

fi

# End while loop that flips through each line in input

done < $INPUT_FROM_LOGDUMP

echo “*****************************************”

echo “Output is also in Excel friendly, CSV file: $OUTPUT”

echo “*****************************************”

# endtime=`date`

# echo “Total bytes: $TOTAL_BYTES”

# echo “Start logdump scripts : $starttime”

# echo “End logdump scripts : $endtime”

# echo “*****************************************”

exit 0

 

________________________________________________________________________

 

Example output:

 

Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal

XXCCS_GGT_O,GGS_HEARTBEAT,117001,0,0,5,754,759,0

XXCCS_GGT_O,ST_KS_CSI_ITEM_INSTANCES_H,1504432167,1090023,3811,20868,0,1110891,3811

XXCCS_GGT_O,ST_KS_XXCSS_QOT_LINES_ALL,1439493623,882341,0,0,0,882341,0

The last two column headers have no corresponding data as they will be populated using Excel functions once imported into a spreadsheet.


 

Sobre Alexandre Pires

ORACLE OCS Goldengate Specialist, OCE RAC 10g R2, OCP 12C, 11g, 10g , 9i e 8i - Mais de 25 anos de experiência na área de TI. Participei de projetos na G&P alocado na TOK STOK, EDINFOR alocado na TV CIDADE "NET", 3CON Alocado no PÃO DE AÇUCAR, DISCOVER alocado na VIVO, BANCO IBI e TIVIT, SPC BRASIL, UOLDIVEO alocado no CARREFOUR e atualmente na ORACLE ACS atendendo os seguintes projetos: VIVO, CLARO, TIM, CIELO, CAIXA SEGUROS, MAPFRE, PORTO SEGURO, SULAMERICA, BRADESCO SEGUROS, BANCO BRADESCO, BASA, SANTANDER, CNJ, TSE, ELETROPAULO, EDP, SKY, NATURA, ODEBRESHT, NISSEI, SICREDI, CELEPAR, TAM, TIVIT, IBM, SMILES, CELEPAR, SERPRO,OKI,BANCO PAN, etc
Esse post foi publicado em GOLDENGATE. Bookmark o link permanente.

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s