Data Insights Super Fast: AWS | Bryte Data Lake Strategy
November 10, 2018

What is Schema-on-Read and how does it differ from Schema-on-Write

Schema-on-Write Vs Schema-on-Read

Databases have employed a Schema-on-Write paradigm for decades, that is, the schema/structure is first defined up front and then the data is written to the said schema as a part of the write process. Once the data has been written to the schema it is then available for reading, as such it’s named Schema-on-Write.

Schema-on-Read has come about in conjunction with the rise of big data, here the data is first landed in its native form (structured and/or unstructured) with no imposed schema. Only once the data is read is the schema is applied, hence Schema-on-Read.

So, the big question is as follows…if Schema-on-Write has been working well as the de facto standard for decades, why would we want to change and what are the advantages of Schema-on-Read?

No upfront schema to be defined when landing the data…

If we consider data as a shared asset within an organisation, it is used by many users, fulfilling many roles for many purposes. Schema-on-Write requires up front definition and understanding of all current and future use cases. This understanding is required to create a ‘one size fitsall’ schema for all the usecases. Typically speaking a ‘one size fits all’ approach can work for everyone but is not a perfect fit for everyone.

This also requires understanding the data upfront and then trying to accommodate future requirements, which may not always be entirely predictable.

Schemas created for specific use cases…

Schema-on-Read does not impose a structure when landing the data,it uses the source or native format or schema. It enables each user, role or purpose to define a schema that is specific to the use case. With Schema-on-Read the schema is defined to the fit the use perfectly. It allows users to use and understand the data and derive value from it instantly, before modelling it for use cases that may benefit from a single model. As future/new/updated use cases are discovered new schemas can be created to meet these future use cases.

Single storage of data…

Schema-on-Read data is stored in the native format, the schema is applied only when read, this allows for a single data set to be used for many different use cases, but only stored a single time. Schema-on-Write would require multiple schemas to be defined and the data would probably be stored multiple times (perhaps once for each use case).

Agility of data…

One of the biggest benefits of Schema-on-Read is the agility of the data, by this we mean data can be landed with minimal up-front effort and then consumed immediately.

Performance…

Schema-on-Read has had its doubters in the past, these are often based around the performance concerns (when compared to Schema-on-Write), however this is no longer the issue it perhaps once was.

How BryteFlow helps to transition between both approaches…

The BryteFlow product allows quick ingestion of data by transferring only the changes in the data or deltas to Amazon S3 in its source format. Ingestion can be scheduled near real time for scale – for large volumes of data it is easier to synchronise frequently rather than one mega extract and load; and for speed – if data is required near real time for operational reporting. The data is automatically consolidated on Amazon S3 – inserts, updates and deletes applied on the data on Amazon S3- and the data can be used as a replica of the source whether it is SAP, Oracle, SQL Server or other sources.

This allows you to use the Schema-on-Read approach, so that you don’t sink all your time preparing the modelled data. Users can access data instantly on Amazon S3. This is also great for operational reporting as you then remove the burden of this access from the operational source. It also gives Data Scientists instant access to raw data for their analysis.

BryteFlow ensures that the data is extracted at a consistent point in time which is configurable, across the source and hence mitigates any temporal inconsistencies in the data, which can result from extracting the data from a source system at different times.

As users get more familiar with the data, data models become imperative; they form a way of consolidating and governing business access across the organisation. In this case, BryteFlow allows you to use an intuitive GUI on your raw Amazon S3 data to model your data and create data assets which can then be used across the organisation in a Schema-on-write fashion.

And Amazon S3 being a scalable, cheap storage for your data allows you to keep both approaches, potentially one for operational access and the other for business intelligence reporting and analytics.
By combining both methods, BryteFlow gives the agility and flexibility for Reporting and Data Analytics which can be harder with other technologies.

BryteFlow gives the agility and flexibility for Reporting and Data Analytics on a scalable, cost effective analytics platform.

Start Your Free Trial