Data Schema

Data schema are object models used for translating inbound and representative data into logical concepts of known events. Data schema are mainly used to properly classify and model existent data within a data warehouse. However, in a modern cloud native data stacks this can include the data lake as well. It is important that schemata represent real objects or events which are philosophically sound and logically consistent. Obviously, if your application logic or data capture procedures are flawed then the data will also be flawed.

Event inflows and registration should have proper schemata attached in the form of a conceptualized data model. If schemata are unavailable, or data are not properly typed or divergent (unstructured data, such as NoSQL event data), then it may be necessary to use data parsing procedures which are more adaptable to unstructured or dirty data.

For data engineering purposes we are usually focused on the following schema types:

Structured Data - flat, two-dimensional tabular data

Unstructured Data - dictionary oriented data - JSON ([key:value] data including derivatives such as AVRO)

Structured Data

A structured schema consists of two-dimensional data arranged in a series of rows and columns and is commonly referred to tabular data. This is a very common architecture and is often seen in engineering projects, data transmissions, spreadsheet software, and others. This is also the most common data warehousing architecture.

Structured data is often represented within data lakes as delimited files, usually CSV's. Delimited files are universally readable and workable within all cloud data warehouses and are great for working with most data. Structured data can be parsed easily within a cloud data warehouse.

semi-structured Data

Multidimensional tables are usually abstract representations of complex dictionary data. These tables are usually built on top of JSON or Avro data and are abstracted within Redshift as a SUPER data type or, in the case of BigQuery, it is instantiated as nested or repeated fields.

There are some notable differences between Redshift's and BigQuery's approach to handling semi-structured data. Redshift uses PartiQL to parse complex JSON or semi-structured data whereas BigQuery can use native SQL. In addition, BigQuery has the JSON data type, which provides a sophisticated and SQL native method for working with JSON data. An in-depth comparison of the two methods is beyond the scope of this guide, but we believe that BigQuery's approach is preferable because it can provide complex analytical and parsing structures without having to move between systems.

If the data you are importing is well structured and clean you can use BigQuery's schema auto-detection. However, we've found that this method this can be quite fragile and a single error in the schema detection process or within the data can result in import or query errors, rendering the schema useless.

We can use SQL Native JSON Parsing to create materialized views or tables using native SQL in either Redshift, BigQuery, or any other SQL data warehouse. This method is useful when working with data that may be messy and require data type translation and materialization.

Contents

Data Schema

Structured Data

semi-structured Data