BryteFlow Ingest is a real-time data replication software replicating data from various sources to destinations. It is a high-performance software that facilitates real-time change data capture from sources with zero load on the source systems. BryteFlow Ingest captures the changes and transfers them to the target system. It automates the creation of either an exact copy or a time series copy of the data source in the target. BryteFlow Ingest performs an initial full load from source and then incrementally merges changes to the destination of choice, the entire process being fully automated.
BryteFlow Ingest works with its companion softwares which are part of BryteFlow Product suite.
BryteFlow Ingest supports the following database sources:
The supported destinations are as follows:
Looking for a different destination?
BryteFlow does custom source/destination on customer request, please contact us directly at info@bryteflow.com.
BryteFlow Ingest can replicate data from any database, any API and any flat file to Amazon S3, Redshift, Snowflake, Databricks, PostgreSQL, Google Bigquery , Apache Kafka etc. through a simple point and click interface. It is an entirely self-service and automated data replication tool.
BryteFlow offers various deployment strategies to its Customers, mentioned below:
BryteFlow Ingest uses log-based Change Data Capture for data replication. Below is the Technical Architecture Diagram, showcasing the same for a standard setup in AWS Environment.
Below is the architecture diagram for BryteFlow Ingest in a standard deployment. Its is the reference architecture for any of the setup instructions that is provided in this user guide. For more details on setting up any optional components please contact BryteFlow support.
Estimated Deployment time : 1 hour (~Approx)
Below is the BryteFlow Ingest architecture showcasing integration with various AWS services which are optional, in a standard deployment.
The above architecture diagram describes a Standard deployment type showcasing the below features:
Estimated Deployment time : 1 day (~Approx)
The high availability architecture explains the way BryteFlow is deployed in a multi-AZ setup. In case of any instance or AZ failures it can be auto scaled in another AZ, without incurring any data loss.
Estimated Deployment time : 4 hours (~Approx)
BryteFlow also offers a hybrid deployment model to its Customers, which is mix of services on-premises and in the AWS Cloud. BryteFlow Ingest can be easily setup on a Windows server which is in an on-premises environment. Whereas, all the destination end-points reside on AWS Cloud, making it a hybrid model. Its recommended to follow secure connectivity between on-premises and AWS services which can be achieved by using VPN connection or AWS Direct Connect, refer to the blog which talks about choices for hybrid cloud connectivity.
Prerequisites of using Amazon Machine Image (AMI) from AWS Marketplace
Using the AMI sourced from the AWS Marketplace requires:
The steps to create AWS services are mentioned in detail under the section ‘ Environment Preparation‘.
Follow below steps prior to launching BryteFlow in AWS via AMI or Custom install on an EC2:
Available options with AMI are volume based, recommended options for EC2 and EMR for each of these volumes.
Total Data Volume | EC2 Recommended | EMR Recommended |
< 100 GB | t2.small | 1 x m4.xlarge master node 2 x c5.xlarge task nodes |
100GB – 300GB | t2.medium | 1 x m4.xlarge master node 2 x c5.xlarge task nodes |
300GB – 1TB | t2.large | 1 x m4.xlarge master node 2 x c5.xlarge task nodes |
> 1TB | Seek expert advice from support@bryteflow.com | Seek expert advice from support@bryteflow.com |
NOTE: Evaluate the EMR configuration depending on the latency required.
These should be considered a starting point, if you have any questions please seek expert advice from support@bryteflow.com
System Requirement when not using Amazon Machine Image (AMI)
The following describes the hardware configuration for a Windows server, assuming that there are a few sources and target combinations (3 medium ideally). It also depends on how intensively the data is being replicated from these sources, so this is a guide, but will need extra resources depending on the amount of data being replicated. The amount of disk space will also be dependent on the amount of data being replicated.
Processor: 4 core
Memory: 16GB
Disk requirements: Depend on the data being extracted, but a minimum of 300GB
Network performance: High
The following softwares are required to be installed on the server:
BryteFlow is a very robust application which makes data replication to cloud very easy and smooth. It can deal with huge data volumes with ease and the process is all automated. The setup done in 3 easy steps. It doesn’t need highly technical resources, basic knowledge of the below is recommended yo deploy the software:
Steps to launch BryteFlow from AWS Marketplace: Enterprise Edition
Supported AWS Regions:
BryteFlow Ingest is validated and supported in below AWS Regions, however it can be launched in all AWS regions.
BryteFlow is available in ALL AWS Regions.
Please contact BryteFlow Support for any needed support.
Additional information regarding launching an EC2 instance can be found here
Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:
** Please note that BryteFlow Blend is a companion product to BryteFlow Ingest. In order to make the most of enterprise capabilities, first setup BryteFlow Ingest completely. Thereafter, no configuration is required in BryteFlow Blend, its all ready to go. Start with the transformations directly off AWS S3.
Once connected to the EC2 instance:
localhost:8081
into the Chrome browser to open the BryteFlow Ingest web consolelocalhost:8082
into the Google chrome browser to open the BryteFlow Blend web console
Steps to launch BryteFlow Ingest from AWS Marketplace: Standard Edition
BryteFlow is available in ALL AWS Regions.
Additional information regarding launching an EC2 instance can be found here
Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:
Once connected to the EC2 instance:
localhost:8081
into the Chrome browser to open the BryteFlow Ingest web consoleSteps to launch BryteFlow from AWS Marketplace: SAP Data Lake Builder
Additional information regarding launching an EC2 instance can be found here
Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:
Once connected to the EC2 instance:
localhost:8081
into the Chrome browser to open the BryteFlow Ingest web console
AWS IAM roles are used to delegate access to the AWS resources. With IAM roles, you can establish trust relationships between your trusting account and other AWS trusted accounts. The trusting account owns the resource to be accessed and the trusted account contains the users who need access to the resource.
BryteFlow’s Recommendations:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "1",
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::
},
{
"Sid": "2",
"Action": [
"ec2:AcceptVpcEndpointConnections",
"ec2:AcceptVpcPeeringConnection",
"ec2:AssociateIamInstanceProfile",
"ec2:CreateTags",
"ec2:DescribeTags",
"ec2:RebootInstances"
],
"Effect": "Allow",
"Resource": "arn:aws:ec2:
},
{
"Sid": "3",
"Action": [
"elasticmapreduce:AddJobFlowSteps",
"elasticmapreduce:DescribeStep",
"elasticmapreduce:ListSteps",
"elasticmapreduce:RunJobFlow",
"elasticmapreduce:ListCluster",
"elasticmapreduce:DescribeCluster"
],
"Effect": "Allow",
"Resource": "arn:aws:elasticmapreduce:
},
{
"Sid": "4",
"Action": [
"sns:Publish"
],
"Effect": "Allow",
"Resource": "arn:aws:sns:
},
{
"Sid": "5",
"Action": [
"redshift:ExecuteQuery",
"redshift:FetchResults",
"redshift:ListTables"
],
"Effect": "Allow",
"Resource": "arn:aws:redshift:
},
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"dynamodb:CreateTable",
"dynamodb:PutItem",
"dynamodb:Update*",
"dynamodb:Get*",
"dynamodb:Scan"
],
"Resource": "arn:aws:dynamodb:
},
{
"Sid": "6",
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:PutSecretValue",
"secretsmanager:UpdateSecret"
],
"Resource": "arn:aws:secretsmanager:*:
},
{
"Sid": "7",
"Effect": "Allow",
"Action": "secretsmanager:ListSecrets",
"Resource": "*"
}
]
}
Below are the various roles and permissions needed for launching and managing BryteFlow application.
Role | Type | Permissions/Policies | Purpose |
EC2Admin | AWS Custom Role for EC2 | List-DescribeInstanceStatus Directory Service List,Write-DescribeDirectories,CreateComputer Systems Manager List,Read,Write ListAssociations, ListInstanceAssociations, DescribeAssociation, DescribeDocument, GetDeployablePatchSnapshotForInstance, GetDocument, GetManifest, GetParameters, PutComplianceItems, PutInventory, UpdateAssociationStatus, UpdateInstanceAssociationStatus, UpdateInstanceInformation |
Create and Manage EC2 instance |
DBAdmin | AWS Custom Role | cloudwatch:DeleteAlarms, cloudwatch:Describe*, cloudwatch:DisableAlarmActions, cloudwatch:EnableAlarmActions, cloudwatch:Get*, cloudwatch:List*, cloudwatch:PutMetricAlarm, datapipeline:ActivatePipeline, datapipeline:CreatePipeline, datapipeline:DeletePipeline, datapipeline:DescribeObjects, datapipeline:DescribePipelines, datapipeline:GetPipelineDefinition, datapipeline:ListPipelines, datapipeline:PutPipelineDefinition, datapipeline:QueryObjects, dynamodb:CreateTable, dynamodb:BatchGetItem, dynamodb:BatchWriteItem, dynamodb:ConditionCheckItem, dynamodb:PutItem, dynamodb:DescribeTable, dynamodb:DeleteItem, dynamodb:GetItem, dynamodb:Scan, dynamodb:Query, dynamodb:UpdateItem, ec2:DescribeAccountAttributes, ec2:DescribeAddresses, ec2:DescribeAvailabilityZones, ec2:DescribeInternetGateways, ec2:DescribeSecurityGroups, ec2:DescribeSubnets, ec2:DescribeVpcs, iam:ListRoles, iam:GetRole, kms:ListKeys, lambda:CreateEventSourceMapping, lambda:CreateFunction, lambda:DeleteEventSourceMapping, lambda:DeleteFunction, lambda:GetFunctionConfiguration, lambda:ListEventSourceMappings, lambda:ListFunctions, logs:DescribeLogGroups, logs:DescribeLogStreams, logs:FilterLogEvents, logs:GetLogEvents, logs:Create*, logs:PutLogEvents, logs:PutMetricFilter, rds:*, redshift:CreateCluster, redshift:DeleteCluster, redshift:ModifyCluster, redshift:RebootCluster, s3:CreateBucket, sns:CreateTopic, sns:DeleteTopic, sns:Get*, sns:List*, sns:SetTopicAttributes, sns:Subscribe, sns:Unsubscribe |
Manage DB access and priviledges |
NetworkAdmin | Custom Role | autoscaling:Describe*, cloudfront:ListDistributions, cloudwatch:DeleteAlarms, cloudwatch:DescribeAlarms, cloudwatch:GetMetricStatistics, cloudwatch:PutMetricAlarm, directconnect:*, ec2:AcceptVpcEndpointConnections, ec2:AllocateAddress, ec2:AssignIpv6Addresses, ec2:AssignPrivateIpAddresses, ec2:AssociateAddress, ec2:AssociateDhcpOptions, ec2:AssociateRouteTable, ec2:AssociateSubnetCidrBlock, ec2:AssociateVpcCidrBlock, ec2:AttachInternetGateway, ec2:AttachNetworkInterface, ec2:AttachVpnGateway, ec2:CreateCarrierGateway, ec2:CreateCustomerGateway, ec2:CreateDefaultSubnet, ec2:CreateDefaultVpc, ec2:CreateDhcpOptions, ec2:CreateEgressOnlyInternetGateway, ec2:CreateFlowLogs, ec2:CreateInternetGateway, ec2:CreateNatGateway, ec2:CreateNetworkAcl, ec2:CreateNetworkAclEntry, ec2:CreateNetworkInterface, ec2:CreateNetworkInterfacePermission, ec2:CreatePlacementGroup, ec2:CreateRoute, ec2:CreateRouteTable, ec2:CreateSecurityGroup, ec2:CreateSubnet, ec2:CreateTags, ec2:CreateVpc, ec2:CreateVpcEndpoint, ec2:CreateVpcEndpointConnectionNotification, ec2:CreateVpcEndpointServiceConfiguration, ec2:CreateVpnConnection, ec2:CreateVpnConnectionRoute, ec2:CreateVpnGateway, ec2:DeleteCarrierGateway, ec2:DeleteEgressOnlyInternetGateway, ec2:DeleteFlowLogs, ec2:DeleteNatGateway, ec2:DeleteNetworkInterface, ec2:DeleteNetworkInterfacePermission, ec2:DeletePlacementGroup, ec2:DeleteSubnet, ec2:DeleteTags, ec2:DeleteVpc, ec2:DeleteVpcEndpointConnectionNotifications, ec2:DeleteVpcEndpointServiceConfigurations, ec2:DeleteVpcEndpoints, ec2:DeleteVpnConnection, ec2:DeleteVpnConnectionRoute, ec2:DeleteVpnGateway, ec2:DescribeAccountAttributes, ec2:DescribeAddresses, ec2:DescribeAvailabilityZones, ec2:DescribeCarrierGateways, ec2:DescribeClassicLinkInstances, ec2:DescribeCustomerGateways, ec2:DescribeDhcpOptions, ec2:DescribeEgressOnlyInternetGateways, ec2:DescribeFlowLogs, ec2:DescribeInstances, ec2:DescribeInternetGateways, ec2:DescribeKeyPairs, ec2:DescribeMovingAddresses, ec2:DescribeNatGateways, ec2:DescribeNetworkAcls, ec2:DescribeNetworkInterfaceAttribute, ec2:DescribeNetworkInterfacePermissions, ec2:DescribeNetworkInterfaces, ec2:DescribePlacementGroups, ec2:DescribePrefixLists, ec2:DescribeRouteTables, ec2:DescribeSecurityGroupReferences, ec2:DescribeSecurityGroupRules, ec2:DescribeSecurityGroups, ec2:DescribeStaleSecurityGroups, ec2:DescribeSubnets, ec2:DescribeTags, ec2:DescribeVpcAttribute, ec2:DescribeVpcClassicLink, ec2:DescribeVpcClassicLinkDnsSupport, ec2:DescribeVpcEndpointConnectionNotifications, ec2:DescribeVpcEndpointConnections, ec2:DescribeVpcEndpointServiceConfigurations, ec2:DescribeVpcEndpointServicePermissions, ec2:DescribeVpcEndpointServices, ec2:DescribeVpcEndpoints, ec2:DescribeVpcPeeringConnections, ec2:DescribeVpcs, ec2:DescribeVpnConnections, ec2:DescribeVpnGateways, ec2:DescribePublicIpv4Pools, ec2:DescribeIpv6Pools, ec2:DetachInternetGateway, ec2:DetachNetworkInterface, ec2:DetachVpnGateway, ec2:DisableVgwRoutePropagation, ec2:DisableVpcClassicLinkDnsSupport, ec2:DisassociateAddress, ec2:DisassociateRouteTable, ec2:DisassociateSubnetCidrBlock, ec2:DisassociateVpcCidrBlock, ec2:EnableVgwRoutePropagation, ec2:EnableVpcClassicLinkDnsSupport, ec2:ModifyNetworkInterfaceAttribute, ec2:ModifySecurityGroupRules, ec2:ModifySubnetAttribute, ec2:ModifyVpcAttribute, ec2:ModifyVpcEndpoint, ec2:ModifyVpcEndpointConnectionNotification, ec2:ModifyVpcEndpointServiceConfiguration, ec2:ModifyVpcEndpointServicePermissions, ec2:ModifyVpcPeeringConnectionOptions, ec2:ModifyVpcTenancy, ec2:MoveAddressToVpc, ec2:RejectVpcEndpointConnections, ec2:ReleaseAddress, ec2:ReplaceNetworkAclAssociation, ec2:ReplaceNetworkAclEntry, ec2:ReplaceRoute, ec2:ReplaceRouteTableAssociation, ec2:ResetNetworkInterfaceAttribute, ec2:RestoreAddressToClassic, ec2:UnassignIpv6Addresses, ec2:UnassignPrivateIpAddresses, ec2:UpdateSecurityGroupRuleDescriptionsEgress, ec2:UpdateSecurityGroupRuleDescriptionsIngress, elasticbeanstalk:Describe*, elasticbeanstalk:List*, elasticbeanstalk:RequestEnvironmentInfo, elasticbeanstalk:RetrieveEnvironmentInfo, elasticloadbalancing:*, logs:DescribeLogGroups, logs:DescribeLogStreams, logs:GetLogEvents, route53:*, route53domains:*, sns:CreateTopic, sns:ListSubscriptionsByTopic, sns:ListTopics, ec2:AcceptVpcPeeringConnection, ec2:AttachClassicLinkVpc, ec2:AuthorizeSecurityGroupEgress, ec2:AuthorizeSecurityGroupIngress, ec2:CreateVpcPeeringConnection, ec2:DeleteCustomerGateway, ec2:DeleteDhcpOptions, ec2:DeleteInternetGateway, ec2:DeleteNetworkAcl, ec2:DeleteNetworkAclEntry, ec2:DeleteRoute, ec2:DeleteRouteTable, ec2:DeleteSecurityGroup, ec2:DeleteVolume, ec2:DeleteVpcPeeringConnection, ec2:DetachClassicLinkVpc, ec2:DisableVpcClassicLink, ec2:EnableVpcClassicLink, ec2:GetConsoleScreenshot, ec2:RejectVpcPeeringConnection, ec2:RevokeSecurityGroupEgress, ec2:RevokeSecurityGroupIngress, ec2:CreateLocalGatewayRoute, ec2:CreateLocalGatewayRouteTableVpcAssociation, ec2:DeleteLocalGatewayRoute, ec2:DeleteLocalGatewayRouteTableVpcAssociation, ec2:DescribeLocalGatewayRouteTableVirtualInterfaceGroupAssociations, ec2:DescribeLocalGatewayRouteTableVpcAssociations, ec2:DescribeLocalGatewayRouteTables, ec2:DescribeLocalGatewayVirtualInterfaceGroups, ec2:DescribeLocalGatewayVirtualInterfaces, ec2:DescribeLocalGateways, ec2:SearchLocalGatewayRoutes, s3:GetBucketLocation, s3:GetBucketWebsite, s3:ListBucket, iam:GetRole, iam:ListRoles, iam:PassRole, ec2:AcceptTransitGatewayVpcAttachment, ec2:AssociateTransitGatewayRouteTable, ec2:CreateTransitGateway, ec2:CreateTransitGatewayRoute, ec2:CreateTransitGatewayRouteTable, ec2:CreateTransitGatewayVpcAttachment, ec2:DeleteTransitGateway, ec2:DeleteTransitGatewayRoute, ec2:DeleteTransitGatewayRouteTable, ec2:DeleteTransitGatewayVpcAttachment, ec2:DescribeTransitGatewayAttachments, ec2:DescribeTransitGatewayRouteTables, ec2:DescribeTransitGatewayVpcAttachments, ec2:DescribeTransitGateways, ec2:DisableTransitGatewayRouteTablePropagation, ec2:DisassociateTransitGatewayRouteTable, ec2:EnableTransitGatewayRouteTablePropagation, ec2:ExportTransitGatewayRoutes, ec2:GetTransitGatewayAttachmentPropagations, ec2:GetTransitGatewayRouteTableAssociations, ec2:GetTransitGatewayRouteTablePropagations, ec2:ModifyTransitGateway, ec2:ModifyTransitGatewayVpcAttachment, ec2:RejectTransitGatewayVpcAttachment, ec2:ReplaceTransitGatewayRoute, ec2:SearchTransitGatewayRoutes |
Manage Network access and firewall settings |
BryteFlowAdmin | Custom Role | elasticmapreduce:ListClusters, glue:GetDatabase, athena:StartQueryExecution, athena:ListDatabases, glue:GetPartitions, glue:UpdateTable, athena:GetQueryResults, athena:GetDatabase, glue:GetTable, athena:StartQueryExecution, glue:CreateTable, glue:GetPartitions, elasticmapreduce:ListSteps, athena:GetQueryResults, s3:ListBucket, elasticmapreduce:DescribeCluster, glue:GetTable, glue:GetDatabase, s3:PutObject, s3:GetObject, elasticmapreduce:DescribeStep, athena:StopQueryExecution, athena:GetQueryExecution, s3:DeleteObject, elasticmapreduce:AddJobFlowSteps, s3:GetBucketLocation, s3:PutObjectAcl, secretsmanager:GetSecretValue, secretsmanager:DescribeSecret, secretsmanager:PutSecretValue, secretsmanager:UpdateSecret |
Able to manage BryteFlow configurations |
Amazon S3 | Resource Based Policy | s3:PutObject, s3:GetObject, s3:DeleteObject, s3:GetBucketLocation, s3:PutObjectAclResource: arn:aws:s3:::<bucket-name>, arn:aws:s3:::<bucket-name>/* |
To manage bucket level permissions, resource-based policy for S3 should be applied to restrict the bucket level access. The policy is attached to the bucket, but the policy controls access to both the bucket and the objects in it. |
Amazon Ec2 | Resource Based Policy | ec2:AcceptVpcEndpointConnections, ec2:AcceptVpcPeeringConnection, ec2:AssociateIamInstanceProfile, ec2:CreateTags, ec2:DescribeTags, ec2:RebootInstancesResource: arn:aws:ec2:<ec2_instance_id> |
To manage instance level permissions, resource-based policy for EC2 should be applied to restrict the access for the EC2 instance. |
AWS Marketplace | AWS managed policy | aws-marketplace:ViewSubscriptions, aws-marketplace:Subscribe, aws-marketplace:Unsubscribe, aws-marketplace:CreatePrivateMarketplaceRequests, aws-marketplace:ListPrivateMarketplaceRequests, aws-marketplace:DescribePrivateMarketplaceRequests |
For a user to launch BryteFlow from AWS Marketplace should have ‘AWSMarketplaceManageSubscriptions’ policy attached. |
Below is the guide provided to prepare an environment for BryteFlow in AWS :
Create an IAM User: Its recommended to create a separate user for managing all AWS services DO NOT use root user for any task. Refer to AWS guide to create an IAM admin user.
Security Group Rules: You can add or remove rules for a security group which is authorizing or revoking inbound or outbound access. A rule applies either to inbound traffic (ingress) or outbound traffic (egress). You can grant access to a specific CIDR range, or to another security group in your VPC or in a peer VPC (requires a VPC peering connection).
Creating IAM Role: BryteFlow uses IAM role assigned to the Ec2 where the application is hosted. The Ec2 role needs to have all the required policies attached. To create an IAM role for BryteFlow refer to the AWS guide. Assign the required policies to the newly created IAM role.
For more information on secret keys refer to AWS documentation here.
For security reasons, when using access keys its recommended to rotate all keys after certain time, say at a period of 90 days. More details mentioned in the section ‘Managing Access Keys‘
5. Creating Auto-Scaling Group, When BryteFlow Ingest needs to be deployed in a HA environment, its recommended to have your EC2 alongwith an Auto Scaling Group.
Please follow the steps here to launch the same via AWS console. When launching an Auto-Scaling group via the console below are the recommended parameters that needs to be specified :
Please refer AWS documentation on how to create EC2 System.
The following table shows the rules we recommend for your EC2. They block all traffic except that which is explicitly required.
The EC2 security group should have the required inbound and outbound rules as per below:
Inbound | |||||
Rule # | Source IP | Protocol | Port | Allow/Deny | Comments |
1 | Custom IP which requires access to BryteFlow Application | TCP | 80 | ALLOW | Allows inbound HTTP traffic from only known/ custom IPv4 address. |
2 | Public IPv4 address range of your home network | TCP | 22 | ALLOW | Allows inbound SSH traffic from your home network (over the Internet gateway). |
3 | Public IPv4 address range of your home network | TCP | 3389 | ALLOW | Allows inbound RDP traffic from your home network (over the Internet gateway). |
4 | 0.0.0.0/0 | all | all | DENY | Denies all inbound IPv4 traffic not already handled by a preceding rule (not modifiable). |
Outbound | |||||
Rule # | Dest IP | Protocol | Port | Allow/Deny | Comments |
1 | Source DB Host IP address | TCP | Custom port( port specific to source database ports) | ALLOW | Allows connections to Source database. |
2 | Redshift Cluster Host IP address | TCP | 5439 ( port specific to destination database i.e. Redshift ) | ALLOW | Allows connection to destination database, if Redshift is a preferred destination database. (not required if AWS S3, is a preferred destination) |
3 | 0.0.0.0/0 | all | all | DENY | Denies all outbound IPv4 traffic not already handled by a preceding rule (not modifiable). |
To open ports on Amazon Console
Please perform the steps to allow the inbound traffic to your Amazon instance, mentioned in the following link:
To open ports On Windows Server
Please perform the steps to allow the inbound traffic to your server, mentioned in the following link:
VPC Details:
Related Subnet details:
Below is the reference for Route table and CIDRs :
BryteFlow connects to any source and destination endpoints outside of its VPC using NAT/VPN or API Gateways.
A NAT gateway is a Network Address Translation (NAT) service. You can use a NAT gateway so that instances in a private subnet can connect to services outside your VPC but external services cannot initiate a connection with those instances. For more details refer to AWS guide.
To connect the VPC to remote network for enabling source/destination endpoint connections, use AWS VPN. For more details refer to AWS guide.
Please refer AWS documentation for creating S3 bucket.
Prior to launching an EMR cluster its recommended to verify the service limits for EMR within your AWS region.
When using BryteFlow in,
To know more about AWS service limits and how to manage service limits click on the respective links.
Login to your AWS account and select the correct AWS region where your S3 bucket and EC2 container are located.
As BryteFlow uses several AWS resources to fulfill user requirements, the cost of these services are separate to BryteFlow charges and are charged by AWS for your account. If you are using Snowflake as a destination the cost of Snowflake Data warehouse is separate to BryteFlow.
Below list provides list of other billable services within BryteFlow. Please use AWS Pricing calculator to estimate AWS cost of additional resources.
A sample estimate for a high availability setup with a source data volume of 100 GB is provided for reference here. Please note not all services are mandatory, the size and no. of services will vary for each customer environment. The sample is just for reference purposes.
Please note: ALL AWS Services has a service limit, do check for sufficient resources before launching the services, if needed request for increase in quota following AWS guidelines. Please refer to the AWS guide to check the service limit corresponding to each service.
Service | Mandatory | Billing Type | Service Limits |
AWS EC2 | Y | Pay-as-you-go | check EC2 quota here |
Additional EBS storage attached to EC2 | Y | Based on size | |
AWS S3 | N | Pay-as-you-go | check Amazon S3 quota here |
AWS EMR | N (only required for S3 as a destination ) | Pay-as-you-go | check EMR quota here |
AWS Redshift | N | Pay-as-you-go | check Amazon Redshift quota here |
AWS CloudWatch Logs and metrics | N | Pay-as-you-go | check EC2 quota here |
AWS SNS | N | Pay-as-you-go | check AWS SNS quota here |
AWS Dynamo DB
(5 WCUs /5 RCUs) |
N | Pay-as-you-go | check Dynamo DB quota here |
Snowflake DW | N | Pay-as-you-go | |
AWS Lambda | N | Pay-as-you-go | check AWS Lambda quota here |
AWS KMS | N | Pay-as-you-go | check Amazon KMS quota here |
AWS Athena | N | Pay-as-you-go | check Amazon Athena quota here |
AWS Kinesis | N | Pay-as-you-go | check Amazon Kinesis quota here |
BryteFlow recommends to use below mentioned instance types for EC2 with EBS volumes attached:
EC2 Instance Type | BryteFlow Standard Edition | BryteFlow Enterprise Edition | Recommended EBS volumes | EBS Volume Type |
t2.small | Volume < 100 GB | NA | 50 GB | General Purpose SSD (gp2) Volumes |
t2.medium | Volume >100 and < 300 GB | Volume < 100 GB | 100 GB | General Purpose SSD (gp2) Volumes |
t2.large | Volume > 300 GB and < 1 TB | Volume >100 and < 300 GB | 500 GB | General Purpose SSD (gp2) Volumes |
m4.large | NA | Volume > 300 GB and < 1 TB | 500 GB | General Purpose SSD (gp2) Volumes |
BryteFlow uses Access key and secret key to authenticate to AWS services like S3, Redshift etc. It requires AWS access key id and AWS secret key for accessing the S3 and other services from on-premises. AWS IAM Roles are used when using an AMI or an EC2 server.
For security reasons, when using access keys or KMS Keys its recommended to rotate keys after certain time, say at a period of 90 days. After the new keys are generated it needs to updated in Ingest’s configuration. Please follow below mentioned steps:
Details of key rotation can be found in AWS documentation https://docs.aws.amazon.com/kms/latest/developerguide/rotate-keys.html
IAM role for ‘BryteFlow’ should have recommended policies attached. Please refer to section ‘AWS Identity and Access Management (IAM) for BryteFlow‘ for the list of policies and permissions.
BryteFlow ensures various mechanisms for data security by applying encryption,
BryteFlow Ingest doesn’t store any data outside of Customer’s designated environment. It can store data into the below AWS services depending on the customer requirements:
Also, all Non-AWS destination endpoints:
As a best practice BryteFlow recommends to enable encryption on all the services wherever the data is getting stored.
BryteFlow recommends to rotate all keys that are being configured in Ingest, every 90 days period for security reasons. This includes all the sources and destination endpoints credentials. Below are some references for AWS services for more details.
For AWS KMS key rotation refer to the AWS guide.
For AWS Redshift key rotation refer to the AWS Guide.
Follow the recommendations as below for all Non-AWS sources and destinations:
External Applications | Reference for Key Rotation |
SAP | SAP password rotation |
Oracle | Oracle Password Rotation |
MS SQL Server | MS SQL Server password rotation |
Salesforce | Salesforce password rotation |
MySQL | MySQL password rotation |
PostgreSQL | PostgreSQL password rotation |
BryteFlow adheres to AWS recommendation of applying encryption of data at rest and in transit. It can be achieved by creating the keys and certificates that are used for encryption.
For more information, refer to AWS documentation on Providing Keys for Encrypting Data at Rest with Amazon EMR and Providing Certificates for Encrypting Data in Transit with Amazon EMR Encryption.
For Amazon Redshift destination, its recommended to enable database encryption to protect data at-rest. Refer to AWS guide for more details.
AWS Secrets Manager uses encryption via AWS KMS, for more details refer to AWS Guide.
Choose options under Encryption according to the following guidelines :
Under S3 data encryption, for Encryption mode, choose a value to determine how Amazon EMR encrypts Amazon S3 data with EMRFS. BryteFlow Ingest supports the below encryption mechanism.
BryteFlow uses SSL to establish any connection(AWS services, databases etc.) for data flow, ensuring secure communication in-transit.
SSL involves complexity of managing security certificates and its important to keep the certificates active all the time for uninterrupted service.
AWS Certificate Manager handles the complexity of creating and managing public SSL/TLS certificates. Customers can have settings to get notified before the expiry date is approaching and can renew upfront, so that the services run uninterruptedly. Refer AWS guide to manage ACM here.
BryteFlow uses AWS Secrets Manager to store any/all credentials. This includes both source and destination endpoint credentials for Databases and APIs. All the secrets are encrypted using the KMS encryption. BrytFlow creates a secret in AWS secrets Manager for all credentials alongwith BryteFlow admin user details and also allows to modify the secret from the GUI as well. Go to the respective setup page in BryteFlow application to update the secret details. Its recommended to rotate all keys stored in Secrets Manager, refer to AWS guide for the same.
To test the remote connections you would need telnet utility. Telnet has to be enabled from the control panel in Turn on Windows Feature.
telnet <IP address or Hostname> Port number
For example
telnet 192.168.1.1 8081
If the connection is unsuccessful then an error will be shown.
If the command prompt window is blank only with the cursor, then the connection is successful and the service is available.
In case of any connectivity issue to source or destination database, please check if the BryteFlow server is able to reach the remote host:port.
You can test the connection to the IP address and port using the telnet command.
telnet <IP address or Hostname> Port number
Or you can use the PowerShell command to verify the connection.
tnc <IP address or Hostname> Port number
Error: Unable to start Windows service ‘BryteFlow Ingest’
Resolution: If Java is not installed or the system path is not updated, then Ingest service will throw an error on startup. Install Java 1.8 or add the java path to the system path. To verify the same, goto CMD and type: java -version
If the response is ‘unable to recognize command’, please check the java path in the Environment variables ‘path’ and update to correct path.
Once the BryteFlow Ingest service is installed and started, but the application is not launching on the browser
Resolution: BryteFlow application requires java 1.8 to function. Please install the correct version of java and restart the service.
If Java 11 is installed, then Ingest service will startup up, but the page will display an error message.
To verify the version, goto CMD and type: java -version
Expected result : java 1.8 <any build>
For Example: java version “1.8.0_171”
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
If the java version is higher please uninstall java and install the required version.
Error: ‘Cannot Open database ‘demo’ requested by the login’
Resolution: The user does not have the grants to connect to the database. Apply the correct grants to the user and try again.
Error: ‘Login failed for User ‘Demo’
Resolution: The user does not exist or there is a typo in the username or the password is incorrect.
Please follow below recommended steps to setup your MS SQL source connector.
SQL Server setup depends on the replication option chosen, Change Tracking OR Change Data Capture. Prerequisites for each option are described in detail. Follow the link for details.
The BryteFlow Ingest database replication login user should have VIEW CHANGE TRACKING permission to view the Change Tracking information.
--Review all change tracking tables that are = 1 enabled, or = 0 disabled SELECT * FROM sys.all_objects WHERE object_id IN (SELECT object_id FROM sys.change_tracking_tables WHERE is_track_columns_updated_on = 1);
To verify if change tracking is already enabled on the database run the following SQL queries. If a row is returned then Change Tracking has been enabled for the database
SELECT * FROM sys.change_tracking_databases WHERE database_id = DB_ID('databasename');
The following SQL will list all the tables for which Change Tracking has been enabled for the selected database
USE databasename; SELECT sys.schemas.name as schema_name, sys.tables.name as table_name FROM sys.change_tracking_tables JOIN sys.tables ON sys.tables.object_id = sys.change_tracking_tables.object_id JOIN sys.schemas ON sys.schemas.schema_id = sys.tables.schema_id;
BryteFlow Ingest source supports most MS SQL Server data types, see the following table for the supported list:
BIGINT | REAL | VARCHAR (max) |
BIT | FLOAT | NCHAR |
DECIMAL | DATETIME | NVARCHAR (length) |
INT | DATETIME2 | NVARCHAR (max) |
MONEY | SMALLDATETIME | BINARY |
NUMERIC (p,s) | DATE | VARBINARY |
SMALLINT | TIME | VARBINARY (max) |
SMALLMONEY | DATETIMEOFFSET | TIMESTAMP |
TINYINT | CHAR | UNIQUEIDENTIFIER |
VARCHAR | HIERARCHYID | XML |
Please follow below recommended steps to setup your Oracle source connector.
exec rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD','ALL');
exec rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours',24);
ALTER TABLE <schema>.<tablename> ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
Execute the following queries on Oracle Server to enable change tracking.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; ALTER DATABASE FORCE LOGGING;
ALTER TABLE <schema>.<tablename> ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
The Oracle user running BryteFlow Ingest must have the following security privileges:
SELECT access on all tables to be replicated
The following statement should return records…
SELECT * FROM V$ARCHIVED_LOG;
If no records are returned, select access on V_$ARCHIVED_LOG should be provided, or check if the database is in ACHIVELOG mode.
The following security permissions should be assigned to the user
CREATE SESSION
SELECT access on V_$LOGMNR_CONTENTS
SELECT access on V_$LOGMNR_LOGS
SELECT access on ANY TRANSACTION
SELECT access on DBA_OBJECTS
EXECUTE access on DBMS_LOGMNR
Run the following grant statements for <user> for the above requirements
GRANT SELECT ON V_$ARCHIVED_LOG TO <user>; GRANT SELECT ON V_$LOGMNR_CONTENTS TO <user>; GRANT EXECUTE ON DBMS_LOGMNR TO <user>; GRANT SELECT ON V_$LOGMNR_LOGS TO <user>; GRANT SELECT ANY TRANSACTION TO <user>; GRANT SELECT ON DBA_OBJECTS TO <user>;
To verify if Oracle is setup correctly for change detection execute the following queries.
Condition to be checked | SQL to be executed | Result expected |
---|---|---|
Is ArchiveLog mode enabled? |
SELECT log_mode FROM V$DATABASE; |
ARCHIVELOG |
Is Supplemental logging turned on at database level? |
SELECT supplemental_log_data_min FROM V$DATABASE; |
YES |
Is Supplemental Logging turned on at table level? |
SELECT log_group_name, table_name, always, log_group_type FROM dba_log_groups; |
RESULT <log group name>, <table name>, ALWAYS, ALL COLUMN LOGGING |
BryteFlow Ingest source supports most Oracle data types, see the following table for the supported list:
BINARY_DOUBLE | BINARY_FLOAT | CHAR |
DATE | INTERVAL DAY TO SECOND | LONG |
LONG RAW | NCHAR | NUMBER |
NVARCHAR | RAW | REF |
TIMESTAMP | TIMESTAMP WITH LOCAL TIME ZONE | VARCHAR2 |
To enable binary logging, the following parameters need to be configured as below in my.ini file on MySQL on Windows or in my.cnf file on MySQL on UNIX:
Parameter | Value |
---|---|
server_id | Any value from 1. E.g. server_id = 1 |
log_bin=<path> | Path to the binary log file. E.g. log_bin = D:\MySQLLogs\BinLog |
binlog_format | binlog_format=row |
expire_logs_days | To avoid disk space issues it is strongly recommended not to use the default value (0). E.g. expire_log_days = 4 |
binlog_checksum | This parameter can be set to binlog_checksum=none. BryteFlow does support CRC32 as well |
binlog_row_image | binlog_row_image=full |
To enable change tracking MySQL on Amazon RDS perform the following steps.
binlog_format: binlog_format=row binlog_checksum : binlog_checksum=none OR CRC32.
CREATE USER 'bflow_ingest_user' IDENTIFIED BY '*****';
GRANT SELECT, REPLICATION CLIENT, SHOW DATABASES ON *.* TO bflow_ingest_user;
GRANT SELECT, REPLICATION slave, SHOW DATABASES ON *.* TO bflow_ingest_user; P.S. If the source DB type is Amazon RDS MySQL DB, please download 'mysqlbinlog.exe' and add its directory path in Windows 'Environment variable' 'PATH' on the Client machine(BryteFlow Server)
binlog_format: binlog_format=row binlog_checksum : binlog_checksum=none OR CRC32.
The max_replication_slots value should be set according to the number of tasks that you want to run. For example, to run four tasks you need to set a minimum of four slots. Slots open automatically as soon as a task starts and remain open even when the task is no longer running. You need to manually delete open slots.
Set max_wal_senders to a value greater than 1.
The max_wal_senders parameter sets the number of concurrent tasks that can run.
Set wal_sender_timeout =0
The wal_sender_timeout parameter terminates replication connections that are inactive longer than the specified number of milliseconds. Although the default is 60 seconds, we recommend that you set this parameter to zero, which disables the timeout mechanism.
Note:- After changing these parameters, a restart is needed for PostgreSQL
On Salesforce Change Data Capture is turned on by default, please do not turn it off.
You would need to generate a security token to be used with Bryteflow Ingest.
A security token is a case-sensitive alphanumeric key that appended to your Salesforce password.
eg. Your Salesforce password to be used with Ingest will be “<your Salesforce password ><security_token>”
A token can be generated by following these steps:
1. Log in to your salesforce account and go to My Setting > Personal > Reset my security token.
2. Click on Reset Security Token button. The token will be emailed to the email account associated with your salesforce account.
For the databases such as DB2, Firebird or for any RDBMS where there are no access to archive logs to get the change data, BryteFlow has the trigger option to get the change data.
For this solution there are certain prerequisites which needs to be implemented:
Please provide relevant grants to BryteFlow replication user in order to proceed with the Trigger Solution.
Pre-Requisites for SAP HANA (Change tracking) :
1.Create a user account for BryteFlow.
CREATE USER <USERNAME> PASSWORD <PASSWORD>;
2. BryteFlow replication user should have ‘select‘ access on the tables to be replicated.
BryteFlow replication user should have access to ‘create triggers’ to be replicated
Grant below priviledges to BryteFlow user created above.
GRANT SELECT, TRIGGER ON SCHEMA <YOURSCHEMA> TO <USERNAME>;
3. BryteFlow replication user should have access to a schema where it can create a table on the source database.
This is used to store transactions for restart and recoverability
Grant below priviledges to BryteFlow user created above.
GRANT CREATE ANY ON SCHEMA <YOURSCHEMA> TO <USERNAME>;
If you are using the AMI from AWS Marketplace, BryteFlow Ingest will be preinstalled as a service in Windows.
Alternatively, you can install the service by executing the following command using the Command Prompt(Admin).
service.exe --WinRun4J:RegisterService
To Start BryteFlow Ingest
localhost:8081
To Stop Bryteflow Ingest
The configuration of BryteFlow Ingest is performed though the web console
localhost:8081
The screen will then present the following tabs (left side of the screen)
Configure Source Databases using below API :
POST http://host:port/ingest/api/ingest?cmd=conn&conn=s
Body:
func=save&src-db=<database name>&src-host=<database host>&src-options=&src-port=<database port>&src-pwd=<database password>&src-pwd2=<database password>&src-type=rds.oracle11lm&src-uid=<database user id>&type=src
Configure Destination Databases using below API :
POST http://host:port/ingest/api/ingest?cmd=conn&conn=d
Body:
dst-bucket=<S3 bucket>&dst-db=<database name>&dst-dir=<S3 work directory>&dst-host=<Redshift host>&dst-iam_role=<IAM Role>&dst-options=&dst-port=<Redshift port>&dst-pwd=<Redshift password>&dst-pwd2=<Redshift password>&dst-type=rds.redmulti&dst-uid=<Redshift user id>&func=save&type=dst
The dashboard provides a central screen when the overall status of this instance of BryteFlow Ingest can be monitored
The connections tab provides access to the the following sub-tabs
Configuration of MS SQL Server, Oracle, SAP (MS SQL Server), SAP (Oracle), MySQL, Salesforce, PostgreSQL, S/4 HANA, SAP ECC and others as a source database.
Please note: When using SID to connect to a dedicated Oracle server instance use ‘:SID’ in the Database Name of source configuration.
Oracle Pluggable DB:
Please note: When using SID to connect to a dedicated Oracle server instance use ‘:SID’ in the Database Name of source configuration.
This connector allows all RDMS databases as a source connector to do a Full Extract and Load. It uses JDBC drivers to connect and extracts the initial data and takes it over to any BryteFLow supported destination.
Please note incremental CDC via logs is not supported for this driver.
BryteFlow Ingest supports Apache Kafka as a destination. It integrates changes into Kafka location.
The incremental data is loaded as a message to Kafka Topics.
Kafka Message Keys and Partitioning is supported in BryteFlow. The Kafka messages contain a ‘key’ in each JSON message and messages can be put into partitions for parallel consumption.
BryteFlow Ingest puts messages into Kafka Topic in JSON format by default.
The minimum size of a Kafka message sent is 4096 bytes.
Prerequisites for Kafka as a target:
BryteFlow allows connection to SSL enabled Kafka.
Below are the required fields for Kafka SSL configuration:
BryteFlow allows connection to Kafka using Kerberos authentication.
Below are the required fields for Kafka Kerberos authentication:
Databricks is a unified cloud-based platform that lends itself to a variety of data goals, including data science, machine learning, and analytics, as well as data engineering, reporting, and business intelligence (BI). Because a single system can handle both, affordable data storage as expected of a data lake, and analytical capabilities as expected of a data warehouse, the Databricks Lakehouse is a much-in-demand platform that makes data access simpler, faster, and more affordable.
BryteFlow supports Databricks on AWS as a destination connector.
Configure Databricks in BryteFlow, Please follow the easy steps as below:
Steps to generate JDBC URL for Databricks:
Steps to generate access tokens:
Azure SQL DB
Azure Synapse SQL
Azure ADLS2
BryteFlow supports Databricks on Azure as a destination connector.
Configure Databricks in BryteFlow, Please follow the description as below:
The DBFS mount point is to be created by creating a notebook on Azure Databricks with the following sample code and run it:
Please update the AZ URL according to the setup.
dbutils.fs.mount(source = ‘wasbs://demo@demoadls2.blob.core.windows.net‘,mount_point = ‘/mnt/blobstorage’,extra_configs = {‘fs.azure.sas.demo.demoadls2.blob.core.windows.net‘:’?sv=2022-11-02&ss=bfqt&srt=co&sp=rwdlacupyx&se=2024-07-25T20:02:42Z&st=2023-07-25T12:02:42Z&spr=https&sig=IPDudzCistFlSkKSb1t2KGneuCmEV7IQTQJwxZroRBo%3D’})dbutils.fs.ls(‘/mnt/blobstorage’)
2. JDBC URL –Can be obtained from AZ Databricks as per the steps mentioned below.
3. Password – Please enter the Databricks Personal access token as password.
4. Database Name –Please enter the Databricks DB name, usually HIVE_METASTORE.
5. Container Name –Please enter the ADLS container name, used as a spool directory to load dat files.
6. Account Name– Please enter the ADLS Account name.
7. Account Key– Please enter the ADLS Account key field
8. Data Directory– Please enter the ADLS data directory path.
Please Note: This section should not be referred for BryeFlow Ingest v3.10 and onward. The user interface is re-designed to give a more logical representation and is available ony to be used as a reference for the previous versions.
To Configure S3 as the file system perform the following steps.
Please Note: This section is available in BryteFlow Ingest v3.10 and onward.
BryteFlow Ingest can access AWS services using IAM roles or via Access keys when used on-premises. The access method and credentials needs to be configured in Ingest.
There are below two options:
When accessing AWS services from BryteFlow server which is on-premises please select ‘AWS Credentials’ in the File system type and provide information as below:
Configure IAM roles in BryteFlow Ingest to access AWS services.
To configure email updates to sent perform the following steps
NOTE: Please review this section in conjunction with Appendix: Understanding Extraction Process
To select the table for transferring to destination database on Amazon Redshift and/or Amazon S3 bucket perform the following steps.
This process of selecting tables, configuring primary keys and mask columns should be repeated for each of the tables. Once complete the next step is to…
This feature is mostly used in SAP environments. This is to allow Type Change of column/fields from native character or numeric format to Integer, Long, Float, Date and Timestamp.
BryteFlow Ingest automatically convert data types during data replication or CDC to the destination formats.
The destination data types are:
INTEGER | @I |
LONG | @L |
FLOAT | @F |
DATE (including format clause e.g. yyyyMMdd) | @D(format) |
TIMESTAMP (including format clause e.g. yyyy-MM-dd HH:mm:ss) | @T(format) |
Please note: The (format) part can be almost anything based on the value in the source column.
Amazon S3 And Amazon Redshift
Partitioning can dramatically improve efficiency and performance, it can be set up when replicating to S3 (data is partitioned in folders) and/or Redshift (data is partitioned into tables). The partitioning string is entered into the Partitioning folder field. The format for partitioning is as follows
/@<column index>(<partition prefix>=<partition_format>)
Column Index
To build a partitioning folder structure the column index (starting from 1) of the column(s) to be used in the partition need to be known, in this simple table there are 3 columns…
Partition Prefix (optional)
Each partition can be prefixed with a named fixed string. The last character the Partition Prefix can be set to ‘=’, ending with ‘=’ is useful when creating partitions on S3 as this facilitates the automated build/recovery of partitions (see below).
An example for partitioning on the first letter of of column 2 (fullname in this case) is as follows:
/@2(fullname_start=%1s)
Refer to the MSCK REPAIR TABLE command in AWS Athena documentation. A lower case partition prefix is recommended as an upper/mixed case partition prefix can result in issues when using Athena.
--Builds/recovers partitions and data associated with partitions MSCK REPAIR TABLE <athena_table_name>;
Once the MSCK REPAIR TABLE <athena_table_name>;
has been executed all data will be added to the relevant partitions….any new data will be automatically added to the existing partitions. However if new partitions are created by BryteFlow Ingest the MSCK REPAIR TABLE <athena_table_name>;
command will have to be re-executed to make the data available for query purposes in the Athena table.
Format
The format is applied to the column index specified above, for example to partition the data by year (on a date column) you’d use the format %y, to partition by the 24 hour format of time you’d use the format %H.
Example 1: Year
Assuming Column Index 7 was a date field…
/@7(year=%y)
This would create partition folders such as
Example 2: YearMonthDay
Assuming Column Index 7 was a date field…
/@7(%y%M%d)
This would create partition folders such as
Example 3: yyyymmdd=YearMonthDay
Assuming Column Index 7 was a date field…
/@7(yyyymmdd=%y%M%d)
This would create partition folders such as (useful format to automate recovery/initial population of data associated with partitions when using Athena)
Example 4: DOB column was used to create sub partitions of yr, mth and day
Assuming DOB Column Index 4 was a date
/@4(yr=%y)/@4(mth=%M)/@4(day=%d)
Example 5: model_nm=model_values and then sub partitions of yearmonth=YearMonth (multiple column partitioning)
Assuming Column Index 6 was a string (containing for example model_name_a, example model_name_b and example model_name_c) and Column Index 13 was a date field…
/@6(model_nm=%s)/@13(yearmonth=%y%M)
Format | Datatype | Description |
%y |
TIMESTAMP |
Four digit year e.g. 2018 |
%M |
TIMESTAMP |
Two digit month with zero prefix e.g. March -> 03 |
%d |
TIMESTAMP |
Two digit date with zero prefix e.g. 01 |
%H |
TIMESTAMP |
Two digit 24 hour with zero prefix e.g. 00 |
%q |
TIMESTAMP |
Two digit month indicating the start month of the quarter e.g. March -> 01 |
%Q |
TIMESTAMP |
Two digit month indicating the end month of the quarter e.g. March -> 03 |
%r |
TIMESTAMP |
Two digit month indicating the start of the half year e.g. March -> 01 |
%R |
TIMESTAMP |
Two digit month indicating the end of the half year e.g. March -> 06 |
%i |
INTEGER |
Value of the integer e.g. 12345 |
%<n>i |
INTEGER |
Value of the integer prefixed by zeros to specified width e.g. %8i for 12345 is 00012345 |
%<m>.<n>i |
INTEGER |
Value of the integer is truncated to the number of zeros specified by <n> and prefixed by zeros to the width specified by <m> e.g. %8.2i for 12345 is 00012300 |
%.<n>i |
INTEGER |
Value of the integer is truncated to the number of zeros specified by <n> e.g. %.2i for 12345 is 12300 |
%s |
VARCHAR |
Value of the string e.g. ABCD |
%<n>s |
VARCHAR |
Value of the string truncated to the specified width e.g. %2s for ABCD is AB |
API Controls for the Schedule:
http://host:port/bryteflow/wv?cmd=rstat&func=off
http://host:port/bryteflow/wv?cmd=rstat&func=on
API Controls to get the Statistics of the Ingest instance:
Executing the below URL will return the statistics of Ingest:
http://host:port/bryteflow/ingest/init?cmd=dcon&func=getstat
You can add additional table(s) if the replication is up and running and the need arises to add a new table to extraction process…
This will initiate the new table(s) for a full extract, once completed BryteFlow Ingest will automatically resume with processing deltas for the new and all the previously configured tables.
If the Table transfer type is Primary Key with History, to resync all the data from source, perform the following steps
In the event of unexpected issues (such as intermittent source database outages or perhaps network connectivity issues etc) it is possible to wind back in time the status of BryteFlow Ingest and replay all of the changes. Suppose there was a problem that occurred at say perhaps 16:04 hours, you can rollback BryteFlow Ingest to a point in time before these issues starting occurring, say 15:00. To perform this operation….
The configuration tab provides access to the the following sub-tabs
Web Port: The port on which the BrtyeFlow Ingest server will run on.
Max Catchup Log: The number of Oracle archive logs will be processed at one instance.
Run Every: Set timer for the minimum minutes between catchup batches.
Convert RAW to Hex: To handle raw columns by converting to hex string instead of ignoring as CHAR(1).
Some additional configurations for Destination Databases,
File compression: Output Compression method, available options are as follows
Loading threads: Number of Redshift loading threads.
Schema for all tables: Ignore source schema and put all tables in this schema on destination
Schema for staging tables: Schema name to be used for staging tables in destination.
Retaining staging tables: Check to Retain staging tables in destination.
Source Start Date: Column name for source date for type-2 SCD record.
History End Date: Column name for history end date for type-2 SCD record.
End Date Value: End date used for history.
Ignore database name in schema: Check to ignore DB name as part of schema prefix for destination tables.
No. of data slices: Number of slices to split data file in to.
Max Updates: Combine updates that exceed this value.
Use SSE: Store in S3 using SSE (server-side encryption).
S3 Proxy Host: S3 proxy host name.
S3 Proxy Host Port: S3 proxy port.
S3 Proxy user ID: S3 proxy user id.
S3 Proxy Password: S3 proxy password.
To get a valid license go to Configuration tab, then to the License tab and email the “Product ID” to the Bryte support team – support@bryteflow.com
NOTE: Licensing is not applicable when sourced from the AWS Marketplace.
BryteFlow Ingest provides High Availability Support, this means it automatically saves the current configuration and execution state to S3 and DynamoDB. As a result an instance of BryteFlow Ingest (including it’s current state) can be recovered should it be catastrophically lost. Before use this must be configured, select the Configuration tab and then the Recovery sub-tab to enter the required configuration.
BryteFlow keeps backup of every successful job execution on S3 and Dynamo DB and makes the latest version available for user to recover from. Follow the below steps to setup recovery and on how to recover in case of failure.
Pre-requisite for Enabling recovery:
Follow below steps to configure the S3 backup location for BryteFlow Ingest:
The recovery data is stored in the DynamoDB (AWS fully managed NoSQL database service). The recovery data for the named instance (in this example Your_Ingest_Name is stored in a DynamoDB table called BryteFlow.Ingest as shown below:
To recover an instance of BryteFlow Ingest, you should source a new Instance of BryteFlow Ingest from the AWS Marketplace
localhost:8081
into the Chrome browser to open the BryteFlow Ingest web consoleBryteFlow Ingest will collect the configuration and saved execution state of the instance selected (in this case ‘Your_Ingest_Name’) and restore accordingly.
Once restored, its recommended to stop the EC2 at fault(previous install):
NOTE: Recovery can also be a method of partial migration between environments (for example DEV to PROD stacks). As the restore will clone the exact source environment and source state further configuration will be required (for example updating configuration options of the PROD stack EMR instance, S3 location etc). But this method could cut down on some of the workload in cases where there are 100’s of tables to be configured and you’re moving to a new EC2 instance.
BryteFlow supports high availability and auto recovery mechanisms in case of faults and failures.
Customers looking for high availability support are recommended to configure their BryteFlow Ingest instance for High availability and recovery. Details to setup this feature is mentioned in the High Availability / Recovery section.
Details to setup these features are mentioned in the Remote Monitoring section.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1", "Action": [ "logs:FilterLogEvents", "logs:GetLogEvents", "logs:PutLogEvents" ], "Effect": "Allow", "Resource": "arn:aws:logs:<region>:<account_ID>:log-group:<log-group-name>:log-stream:<log-stream-name>" }, { "Sid": "Stmt2", "Action": [ "sns:Publish", "sns:TagResource" ], "Effect": "Allow", "Resource": "arn:aws:sns:<region>:<account_ID>:<topic_name>" } ] }
BryteFlow does auto recovery of the instance and as it uses most durable services like S3 and Dynamo db to store its data, the data has unlimited retention.
In case of customer data, it totally depends on the Customer’s source db settings for data retention. If the source data is available BryteFlow Ingest can recover and replicate from thereon.
BryteFlow ensures it meets the customer expectation of near real time latency and hence tries to recover automatically in most of the failure scenarios.
For EC2 failures, RTO for BryteFlow applications is very minimal(in minutes) as it maintains the save-point of Ingest application in near real-time onto Dynamo DB, which is highly durable AWS service. When the Ingest instance is back online after a restart or after it was terminated abruptly(mostly in case of EC2 failures). It resumes from the last successful save-point and continues onward replication, without needing any full reload.
For EMR Failures, the RTO depends on the time taken to launch a new cluster so when using single EMR cluster the time varies from 10-30 minutes. Until the EMR is up it retries with exponential back-off mechanism until a successful connection is established and replication continues from the same point, without any data loss.
Please note: No full-reload is needed unlike other available solutions.
Once BryteFlow Ingest is recovered from any failure, follow below steps to perform basic checks before starting the replication in order to avoid any further errors or issues:
It is important for any organisation to maintain server and access logs for security and audit purposes. BryteFlow recommends to enable logging for most of the AWS services being used like AWS S3 and AWS EMR.
For enabling logs on S3 please refer to AWS documentation below : https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html#server-access-logging-overview
Amazon EMR produces logs by default, these are written to the master node. BryteFlow recommends to launch the EMR cluster in ‘Logging’ mode, which enables logs to be written to S3 bucket. EMR is integrated with AWS CloudTrail. CloudTrail captures all API calls for Amazon EMR as events. All Amazon EMR actions are logged by CloudTrail and are documented in the Amazon EMR API Reference.
Customers can enable continuous delivery of CloudTrail events to an Amazon S3 bucket, including events for Amazon EMR. The information collected by CloudTrail can help determine the request that was made to Amazon EMR, the IP address from which the request was made, who made the request, when it was made, and additional details.
For more details on Logging Amazon EMR API Calls in AWS CloudTrail refer to AWS documentation below: https://docs.aws.amazon.com/emr/latest/ManagementGuide/logging_emr_api_calls.html
BryteFlow Ingest comes pre-configured with remote monitoring capabilities. These capabilities leverage existing AWS technology such as CloudWatch Logs/Events. CloudWatch can be used (in conjunction with other assets in the AWS ecosystem) to monitor the execution of BryteFlow Ingest and in the event of errors/failures raise the appropriate alarms. The events from Ingest application flows to CloudWatch Logs and Kinesis(if configured). These events provides in detail application stats which can be used for any kind of custom monitoring.
In addition to the integration with CloudWatch and Kinesis, BryteFlow Ingest also writes the internal logs directly to S3 (BryteFlow Ingest console execution and error logs).
Prerequisites for enabling Remote Monitoring in BryteFlow Ingest are as below:
Please note below services are optional and Customers can choose to setup any, all or none.
To Configure the remote monitoring perform the following steps :
The events that BryteFlow Ingest pushes to AWS CloudWatch Logs and metrics console are as follows, please refer to Appendix: Understanding Extraction Process for a more detailed breakdown.
Bryte Events | Description |
---|---|
LogfileProcessed | Archive log file processed (Oracle only) |
TableExtracted | Source table extract complete MS SQL Server and Oracle (initial extracts only) |
ExtractCompleted | Source extraction batch is complete |
TableLoaded | Destination table load is complete |
LoadCompleted | All destination table loads in a batch is complete |
HaltError | Unrecoverable error occurred and turned the Scheduler to OFF |
RetryError | Error occurred but will retry |
You can monitor the progress of your extracts and loads by navigating to the “Log” tab. The log shows the progress and current activity of the Ingest instance. Filters can be set to view specific logs like errors etc.
BryteFlow Ingest stores the log files under your install folder, specifically under the \log folder.
The path to log file is as follows <install folder of Ingest>\log\sirus*.log, for example
c:\Bryte\Bryte_Ingest_37\log\sirus-2019-01.log
The error files are also stored under the \log folder.
The path to log file is as follows <install folder of Ingest>\log\error*.log, for example
c:\Bryte\Bryte_Ingest_37\log\error-2019-01.log
These logs can also be reviewed/stored in S3, please refer to the following section on Remote Monitoring for details.
EMR Tagging
BryteFlow Ingest supports EMR Tagging feature which helps you dramatically to save cost on the EMR Clusters. This helps customers to control EMR cost by terminating the cluster when not in use without interrupting Ingest config and schedule.
You can add default tag ‘BryteflowIngest’ when creating a new Amazon EMR cluster for Ingest or you can add, edit, or remove tags from a running Amazon EMR cluster. And, use the tag name and value in the EMR Configuration section of Ingest as in below image. Also very well handles the EMR changeover, in case an existing cluster tag name is changed and a new cluster with the correct tag is created, any existing jobs on the old cluster will complete and new jobs is started on the new cluster.
AWS allows customers to assign metadata to their AWS resources in the form of tags. It is also recommended that you tag all the AWS resources created for and by BryteFlow for managing and organizing resources, access control, cost tracking, automation, and organization.
Its recommended to use tags with names which are specific to the instance being created, for example, for a BryteFlow instance which is replicating source db which is a Production database server for Billing and Finances, tag names should reflect the dbname it is dedicated to like ‘BryteFlowIngest_BFS_EC2_Prod’, similarily for UAT environment it can be ‘BryteFlowIngest_BFS_EC2_UAT’. By doing this Customers can easily differentiate between the various AWS resources being used within their environment. Use similar tag names for each service.
BryteFlow recommends to tag the below listed AWS services used by with unique identifiable tag name.
For the detail guide on tagging resources in AWS refer to the AWS documentation links provided:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags.html
https://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html
Users already using BryteFlow AMI Standard Edition can easily upgrade to the latest version of the software directly from AWS Marketplace by following few easy steps .
Steps to perform in your current install :
Steps to perform in your new install :
localhost:8081
into the Chrome browser to open the BryteFlow Ingest web consoleOnce upgraded, its recommended to stop the AMI for previous install:
BryteFlow’s Licensing model is based on the data volumes at the source getting replicated across to destination.
Volume based licensing are classified into below groups:
BryteFlow products are available to use from the AWS marketplace for data volumes upto 1TB. For source data volumes > 1TB its recommended to contact BryteFlow support(email: support@bryteflow.com) for detail information and inquiry.
BryteFlow products are available for use via AWS Marketplace. It comes in two different flavors:
BryteFlow Enterprise Edition-Data Integration S3, Redshift, Snowflake
Each of our products is backed by our responsive support team. Please allow for 24 hours for us to get back to you. To get in touch with our support team, shoot an email to support@bryteflow.com
BryteFlow provides Level 3 support to all its Customers.
Tier 1 | Business hours |
Tier 2 | 24×7 Support |
Support Language | English(US&UK) |
Maintenance And Support Level | Description |
Business Hours Support | Support for suspected bugs, errors or material differences between the use of Software and the specifications of Software outlined in the Documentation (Incidents). The scope of the Maintenance and Support Service is outlined with additional terms and conditions at Appendix A. |
Premium Support | § 24×7 Support
§ Email support § Access to customer portal § Software updates § Escalation management for critical issues § Severity 1 issues – 1 hour § Severity 2 issues – 2 hours § Severity 3 issues – 4 hours
|
Extraction has two parts to it.
An initial extract is done for the first time when we are connecting a database to BryteFlow Ingest software. In this extract, the entire table is replicated from the source database to the destination (AWS S3 or AWS Redshift).
A typical extraction goes through the following processes. Below example shown is the extraction from MS SQL Server as source and Amazon S3 bucket and destination.
Extracting 1 Full Extract database_name:table_name Info(ME188): Stage pre-bcp Info(ME190): Stage post-bcp Info(ME260): Stage post-process Extracted 1 Full Extract database_name:table_name complete (4 records) Load file 1 Loading table emr_database:dbo.names with 4 records(220 bytes) Transferring null to S3 Transferred null 10,890 bytes in 8s to S3 Transferring database_name_table_name to S3
After the initial extract, when the database is replicated to the destination, we do a delta extract. In delta extracts, only the changes on the source database are extracted and merged with the destination.
After the initial extraction is done all the further extract are Delta Extracts (changes since the last extract.)
A typical delta extracts log file is shown below.
Extracting 2 Delta Extract database_name:table_name Info(ME188): Stage pre-bcp Info(ME190): Stage post-bcp Info(ME260): Stage post-process Delta Extract database_name complete (10 records) Extracted 2 Load file 2 Loaded file 2
Keep all defaults. Click on Full Extract.
The first extract always has to be a Full Extract. This gets the entire table across and then the delta’s are populated periodically with the desired frequency.
To configure extracts to run at a specific time perform the following steps.
After database have been selected for extraction and they are replicating. If a need arises to add a new table to extraction process then it can be done by following steps.
This will include the new table(s) for a full extract and also resume with deltas for all the previously configured tables and the newly added table(s).
If the Table transfer type is Primary Key with History, to resync all the data from source, perform the following steps
BryteFlow Ingest supports connection to AWS Cloudwatch Logs, Cloudwatch Metrics and SNS. This can be used to monitor the operation of Bryteflow Ingest and integrate with other assets leveraging the AWS infrastructure.
AWS Cloudwatch Logs can be used to send logs of events like load completion or failure from Bryteflow Ingest. Cloudwatch Logs can be used to monitor error conditions and raise alarms.
Below are the list of Events that BryteFlow Ingest pushes to AWS CloudWatch Logs console and for AWS SNS :
Bryte Events | Description |
---|---|
LogfileProcessed | Archive log file processed (Oracle only) |
TableExtracted | Source table extract complete MS SQL Server and Oracle (initial extracts only) |
ExtractCompleted | Source extraction batch is complete |
TableLoaded | Destination table load is complete |
LoadCompleted | All destination table loads in a batch is complete |
HaltError | Unrecoverable error occurred and turned the Scheduler to OFF |
RetryError | Error occurred but will retry |
Below is the detail for each of the Bryte Events :
Event : LogfileProcessed
Attribute |
Is Metric(Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “LogfileProcessed” | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
fileSeq | N | File sequence | |||
file | N | File name | |||
dictLoadMS | Y | Time taken to load dictionary in ms | |||
CurrentDBDate | N | Current database date | |||
CurrentServerDate | N | Current Bryte server date | |||
parseMS | Y | Time taken to parse file in ms | |||
parseComplete | N | Timestamp when parsing is complete | |||
sourceDate | N | Source date |
Event : TableExtracted
Attribute |
Is Metric(Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “TableLoaded” | |||
subType | N | Table name | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
tabName | N | Table name | |||
success | N | true/false | |||
message | N | Status message | |||
sourceTS | N | Source date time | |||
sourceInserts | Y | No. of Inserts in source | |||
sourceUpdates | Y | No. of Updates in source | |||
sourceDeletes | Y | No. of Deletes in source |
Event : ExtractCompleted
Attribute |
Is Metric(Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “ExtractCompleted” | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
jobType | N | “EXTRACT” | |||
jobSubType | N | Extract type | |||
success | N | Y/N | |||
message | N | Status message | |||
runId | N | Run Id | |||
sourceDate | N | Source date | |||
dbDate | N | Current database date | |||
fromSeq | N | Start file sequence | |||
toSeq | N | End file sequence | |||
extractId | N | Run id for extract | |||
tableErrors | Y | Count of table errors | |||
tableTotals | Y | Count of total tables |
Event:TableLoaded
Attribute |
Is Metric(Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “TableLoaded” | |||
subType | N | Table name | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
tabName | N | Table name | |||
success | N | true/false | |||
message | N | Status message | |||
sourceTS | N | Source date time | |||
sourceInserts | Y | No. of Inserts in source | |||
sourceUpdates | Y | No. of Updates in source | |||
sourceDeletes | Y | No. of Deletes in source | |||
destInserts | Y | No. of Inserts in destination | |||
destUpdates | Y | No. of Updates in destination | |||
destDeletes | Y | No. of Deletes in destination |
Event : LoadCompleted
Attribute |
Is Metric(Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “LoadCompleted” | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
jobType | N | “LOAD” | |||
jobSubType | N | Sub type of the “LOAD” | |||
success | N | Y/N | |||
message | N | Status message | |||
runId | N | Run Id | |||
sourceDate | N | Source date | |||
dbDate | N | Current database date | |||
fromSeq | N | Start file sequence | |||
toSeq | N | End file sequence | |||
extractId | N | Run id for extract | |||
tableErrors | Y | Count of table errors | |||
tableTotals | Y | Count of total tables |
Event : HaltError
Attribute |
Is Metric (Y/N)? |
Description |
|||
---|---|---|---|---|---|
type | N | “HaltError” | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
message | N | Error message | |||
errorId | N | Short identifier |
Event : RetryError
Attribute |
Is Metric (Y/N) ? |
Description |
|||
---|---|---|---|---|---|
type | N | “RetryError” | |||
generated | N | Timestamp of message | |||
source | N | Instance name | |||
sourceType | N | “CDC” | |||
message | N | Error message | |||
errorId | N | Short identifier |
Release details (by date descending, latest version first)
Release Notes BryteFlow Ingest – v3.11.4
Released February 2022
New Features
Known Issues
1. Sync Struct not supported for S3/EMR destination with CSV(Bzip2,Gzip,None) output format.
Only supported for Parquet (Snappy) and Orc (Snappy).
2. Athena table creation is supported for Parquet (Snappy) compression only from S3/EMR.
3. Source database type- JDBC Full Extract do not work for all databases.
Release Notes BryteFlow Ingest – v3.11
Released April 2021
New Features
1. NEW UI on DATA page with tree and list view with filters for table.
2. Support for Oracle RAC source.
3. Timestamp Changes With Daylight Savings.
4. Fix for the tables having special characters in table-name like slash and underscore.
5. Fixes related to Type-Change tables.
6. Fixes for Postgres source.
Known Issues
1. Sync Struct not supported for S3/EMR destination with CSV(Bzip2,Gzip,None) output format.
Only supported for Parquet (Snappy) and Orc (Snappy).
2. Athena table creation is supported for Parquet (Snappy) compression only from S3/EMR.
3. Source database type- JDBC Full Extract do not work for all databases.
Release Notes BryteFlow Ingest – v3.10.1
Released October 2020
New Features
Bug Fixes
Limitations:
Release Notes BryteFlow Ingest – v3.10
Released June 2020
New Features
Release Notes BryteFlow Ingest – v3.10
Released May 2020
New Features
Bug Fixes
Known Issues
Release Notes BryteFlow Ingest – v3.9.3
Released March 2020
New Features
Bug Fixes
Release Notes BryteFlow Ingest – v3.9
Released December 2019
New Features
Bug Fixes
Known Issues
Release Notes BryteFlow Ingest – v3.8
Released November 2019
New Features
Bug Fixes
Known Issues
Release Notes BryteFlow Ingest – v3.7.3
Released April 2019
New Features
Bug Fixes
Known Issues
Release Notes BryteFlow Ingest – v3.7
Released: January 2019