Operational FAQ for BryteFlow Ingest

FAQ

1.How to restart BryteFlow Ingest

Steps to perform on how to restart Ingest.

Turn off Ingest Schedule and wait for load to complete, then stop the BryteFlow Windows service.
Restart BryteFlow Windows service and wait for the Ingest dashboard to appear (refresh the page).
Go to Schedule tab, then turn on the schedule.
BryteFlow Ingest will pick up the correct sequence for processing, but it is good to do a sanity check and ensure that the last sequence number loaded correctly and the new sequence number started correctly, otherwise perform “Rollback” to the correct sequence number for processing.

2.Rollback

Steps to perform Rollback in Ingest.

Turn off Ingest Schedule and wait for the load to complete.
From the Schedule tab:
1. Click Rollback
2. In the pop-up screen, select sequence number from where you want to replay the log files. Please note Log files should be available on the source server.
3. Click Select
4. Turn on the Schedule.

3.Full Extract

Steps to perform Full Extract when an Initial Sync needs to be performed for all tables configured in a pipeline.

Turn off Ingest Schedule and wait for the load to complete.
From the Data tab:
1. Select the tables (one or more) where you want the Full extract.
2. Ensure Primary key (PK) column(s) are applied
3. Select Delta extract (PK or PK Hist).
4. Click Apply.
Goto Schedule tab:
1. Click Full Extract.
2. Check logs and ensure that selected tables started full extract.

4.Redo Initial Extract

Steps to perform Redo Initial Extract for existing tables.

Turn off Ingest Schedule and wait for the load to complete.
From the Data tab:
1. Select the table(s) where you want to redo the initial extract.
2. Ensure Primary key (PK) column(s) are applied.
3. Select Delta extract (PK or PK Hist).
4. Enable Redo Initial Extract toggle button
5. Click Apply.
Go to Schedule tab:
1. Click Sync New Tables.
2. Check logs and ensure the selected table(s) started full extract.

5.Skip Initial Extract

Steps to perform deltas without an Initial Sync of the data or for XL-Ingest.

Turn off Ingest Schedule and wait for the load to complete.
From the Data tab:
1. Select the table(s) where you want to perform deltas without an Initial Sync of the data.
2. Ensure Primary key (PK) column(s) are applied
3. Select Delta extract (PK or PK Hist).
4. Enable Skip Initial Extract toggle button
5. Click Apply.
Goto Schedule tab:
1. Click Full Extract.
2. Check logs and ensure the selected table(s) skipped full extract.

6.Add New tables into Existing Pipeline

Steps to perform adding only new tables into existing pipelines.

Turn off Ingest Schedule and wait for the load to complete.
From the Data tab:
1. Select new tables (one or more) as per requirement.
2. Ensure Primary key (PK) column(s) are applied
3. Select Delta extract (PK or PK Hist).
4. Click Apply.
Go to Schedule tab:
1. Click Sync New Tables.
2. Check logs and ensure that existing tables were ignored and new table(s) started to sync.

7.Synchronize Schema changes using Sync Struct

Steps to perform Sync Struct for existing tables when source structure changes.

Turn off Ingest Schedule and wait for the load to complete.
Go to Schedule tab:
1. Click Sync Struct.
2. Check logs, Ingest will compare the structures and synchronize structures that are different.

8.How to stop BryteFlow Ingest

Steps to perform on how to stop Ingest.

Turn off Ingest Schedule and wait for the load to complete.
Open Window services and stop the BryteFlow Windows service.
If BryteFlow Windows service have stopped successfully but the Ingest page is still loading then you might need to kill the non responding processes.
To terminate the non responding Ingest process, perform below steps.
1. Open command prompt as Administrator.
2. Execute the command – netstat -ano | find “8081”
3. Replace 8081 with the port number on which Ingest is running.
4. In the example below, Ingest is running on port 9090. The process id of ingest is 4044 as highlighted in the snapshot.
5. Open Windows Task Manager, under the Details tab, find the process with PID 4044.
6. Right click on the process and click the End task.
7. This will kill the runaway Ingest Java process.

9.Prerequisites for Oracle source

BryteFlow setup guide details on how to enable supplemental logging for source table(s) for Delta replication. Please refer to the setup guide.

https://docs.bryteflow.com/bryteflow-setup-guide#prerequisite-for-oracle-source

Important note:

If Database supplemental logging is enabled then only inserts and deletes will come through and updates will be missed.
If Table level supplemental logging is enabled only updates come through and inserts and deletes will be missed.

Hence, both Database and Table level supplemental loggings are required.

10.Create New Pipeline

Steps to create an additional Ingest pipeline.

Unzip the provided zip file to a folder. Preferably something like c:\bryteflow\pipeline2\
Edit the ingest.xml file in the c:\bryteflow\pipeline2\ folder.
Change the <id>Bryteflow-Ingest</id> and <name>BryteFlow Ingest</name> xml tags with new pipeline reference names.
Check if port 8081 is free and not used by any other processes.
Open a command prompt as Administrator, and navigate to the new pipeline folder and execute Install-service.bat. Close the command prompt.
Open the Windows services console and start the newly created service.
Launch the pipeline on localhost:8081
Its recommended to change the port from the ‘Configuration’->’Source Database’-> ‘Web Port’ to any other preferred web port, in order to use 8081 from creating new pipelines.
Restrat the windows service and launch the BryteFlow appliccation on the new web port for eg: localhost:8082
Make sure to avail a valid license key in order to get started.

11.Latency calculation

Latency is calculated based on time from the record is committed to source till the record is available on the destination.

12.S3 Delta File format

S3 Delta files contains the incremental data for a particular run from the source database.

The file has 2 types of fields:

Constant fields : These are present in all files by default and are meant to provide additional details about the record in file.
Variable fields : These varies for each file and aligns with the table structure. The columns on between ‘op_’ and ‘seq_no’ are the table columns as per the table definition at the source.

Different Constant fields and their description:

op_ – This field determines the type of operation for each record in the file. There are different values for this field. Each operation code value is described below
1. 1. A – The record is extracted as part of FULL INITIAL Load.
  2. I – The record is an incremental record and the operation is INSERT.
  3. D – The record is an incremental record and the operation is DELETE.
  4. i – The record is an incremental record and the operation is UPDATE. The values for this record are the latest updated as of the source.
  5. d – The record is an incremental record and the operation is UPDATE. The values for this record are the previous value(prior to update).

2. seq_no – This field determines the order of each record. The value is a sequence NUMBER and is incremented by 1 for each record within a file. For incremental files, the value determines the order of DML operations performed on the source.

3. eff_dt – The column is present in the files for which the table Delta replication is set to ‘Primary Key with History‘ . This determines the commit time for the record on the source. Each record will have its commit time as ‘EFF_DT’ in the delta files.

13.Manifest file for S3 EMR Destination

BryteFlow Ingest delivers data to the S3 data lake from relational databases like SAP, Oracle, SQL Server, Postgres, and MySQL in real-time or changed data in batches (as per configuration) using log-based CDC.

The upsert on the S3 data lake is automated and requires no coding nor integration with any third-party applications.

It prepares a manifest file on AmazonS3 with the details of the files loaded onto the data bucket.

Location: [data directory path]/manifest/[table_name_{partitionname}].manifest

File Type: manifest

File format: JSON

File Details : File contains the list of S3 URL’s for all the data files created for each table respectively.

Sample entry for the manifest file:

{
“entries”: [
{“url”:”s3://samplebucket/data/SAMPLETABLE_A/20230615130423/part-r-00000.snappy.parquet”, “meta”: { “content_length”: 26267},”mandatory”:true}
]
}

14.Struct file for S3 EMR destination

The upsert on the S3 data lake is automated and requires no coding nor integration with any third-party application.

It prepares a .struct file on AmazonS3 with the details of the table structure of the loaded tables.

Location: [data directory path]/manifest/[table_name_{partitionname}].struct

File Type: .struct

File format: JSON

File Details : File contains the structure of the table with column list and its data types.

manifest : Path and name of manifest file
srcTable : Table name of the source database
dstFile : Filename created at destination
dstCleanFile : Filename created at destination
outputFormat: Format of the data file created
structure : List of columns and its datatypes in order of the file created.

Sample entry for the STRUCT file:

{
“manifest”: “s3://samplebucket/data/manifest/SAMPLETABLE_A.manifest”,
“srcTable”: “default:SAMPLETABLE_A”,
“dstFile”: “SAMPLETABLE_A”,
“dstCleanFile”: “SAMPLETABLE_A”,
“outputFormat”: “parquet.snappy”,
“structure”: [
{
“name”: “COLUMN1”,
“type”: “VARCHAR”
},
{
“name”: “COLUMN2”,
“type”: “INTEGER”
},
{
“name”: “COLUMN3”,
“type”: “VARCHAR”
},
{
“name”: “COLUMN4”,
“type”: “INTEGER”
}
]
}

15.Enable Windows Authentication for SQL Server DB

Steps for enabling Windows authentication in BryteFlow Ingest:

1. Stop the BryteFlow Ingest service.

2. Download and Copy the sqljdbc_auth.dll to the bin directory of your java. You can download the DLL from “https://bryteflow.com/release/files/sqljdbc_auth.zip” or you can obtain it from the Microsoft site.

3. Start Ingest service and add the following to the JDBC option in the Source database connection settings :
integratedSecurity=true

16.How to setup BryteFlow pipeline for Rollover

BryteFlow Ingest can be setup to Rollover in a parallel pipeline.

Below are the steps to rollover, Please note this can be implemented using REST API calls of BryteFlow Ingest, details below:

Prerequisites for Bryteflow Ingest:

1. Two parallel pipelines needs to be setup in a primary and secondary mode. And, Both the pipeline should be in running state:

2. Primary Ingest – Scheduler should be ON state.

3. Secondary Ingest – Scheduler should be OFF state with skip initial ON for the tables and 1 delta(empty) done.

4. Source database prerequisites needs to be implemented on each source DBs. For eg: If using BryteFlow’s MS SQL Server( change tracking) connector, then Change tracking should be enabled for all the tables in SQL Server post Full Extract from Pipeline 1.

For Planned DR test:

Pipeline 1 Source DB is down, turn off the scheduler for the primary ingest (Pipeline 1).

For kicking off Data Replication from Pipeline 2 :

1. Hit the Roll forward API as follows: (Please use the URL of pipeline2 as per your setup)

http://localhost:8081/ingest/api/ingest?cmd=rstat&func=on&force=y

2. The scheduler will be force started and the LSN rollover will take place automatically to the latest one.

3. Next delta extract should start and keep the Source and Destination DBs in sync.

The same process should be repeated to bring back the primary ingest (pipeline 1) in action.

17.Optimization for Amazon Redshift as a destination database

BryteFlow loads data to Redshift using the recommended best practices for bulk data load.

There are few steps thats can be done as a part of prep when using Redshift as a destination and can make the performance even more optimized.

1. Compression settings– Please check the configuration in BryteFlow pipeline should be set to BZIP2, verify it by checking the files on S3 are bzip2 compressed.

2. Dist keys – Distribution key on Redshift table gets created automatically by Bryteflow during the first Full Extract. Make sure the right column is used for dist key, for this arranging primary key columns in Ingest config.xml- <Tablesettings> for that table. The column you want Dist key to be created should be the first column of the composite primary key column list. Columns like MANDT and CLIENT goes back in list.

For tables to be not dropped on Redshift once the right dist key is created, please add the below parameter in the <settinglist> of config.xml.

3. Table encoding on Redshift. Its recommended to encode the table after the FULL extract or bulk load is done for better performance. Please check the table properties on Redshift, if encoding is not enabled. Please enable it. Refer to AWS docs for more details.

After all these optimizations are in place and still any slowness is observed, please reach out to your Redshift DBA for any further vacuuming or optimization that may be needed.

Operational FAQ for BryteFlow Ingest