[Data] Ecosystem of the Modern Data Infrastructure

Apr. 22, 2021

So yesterday, I’ve come acrossed this chart of different technologies employed by enterprises for building a modern data infrastructure. And though some are popular such as AWS (S3, Lambda) or Google (Big Query) some are those I have not hear about. As such I decided to look into each one of them briefly in this post and go into more details by category in subsequent articles.

The requirements for a product to be included in this chart (made by indicative) are:

Connect directly to the cloud data warehouse (or lake) as a single source of truth
Be self-service or open-source
Provide value in <= 3 hours

Storage

Cloud data warehouse

amazon Redshift: can work with structured and semi-structured data using standard SQL. Result can be saved to S3, and further analysed with services like EMR, Athena, and SageMaker.
Google BigQuery: serverless, highly scallable multi-cloud data warehouse. Can work with structured and semi-structured data using SQL. Query streaming data in real time. Built-in BI engine. Built-in machine learning. Can export ML models to other platforms for prediction.
Azure Synapse Analytics: data integration, enterprise data warehousing, big data analytics altogether. Freedom to use serverless or dedicated resources.
Yellowbrick: a modern, MPP analytic database. ANSI SQL and PostgreSQL compatible.
snowflake: unlimited number of concurrent users and queries, structured and semi-structured data with SQL. JSON and XML.

Data lakes

amazon S3: Simple Storage Service is an object storage service. Can use it to store any amount of data for a range of use cases, such as data lakes, websites, enterprise application.
Google Cloud Storage: object storage, multiple redundancy options, easy data transfer.
Delta Lake: an open-source project that enables building a lakehouse architecture of top of existing storage systems such as S3, AZDS, GCS, and HDFS
Azure Data Lake Storage:, limitless scale, single storage platform for ingestion, processing, and visualization

Data Ingestion

ETL/ELT (Databases and 3rd Party System)

aws Glue: a serverless data integration service, provides both visual and code-based interfaces
Fivetran: support for modern cloud warehouses, robust, automated pipelines
Airbyte: open-source data integration platform, the new standard to sync data from applications, APIs, & databases to warehouses, lakes & other destinations
Google Cloud Data Fusion: visual point-and-click interface, code-free deployment of ETL/ELT data pipelines
Xplenty: implement ETL, ELT using graphic interface, for everyone regardless of tech experience
Matillion: comes with an extensive list of pre-built data source connectors, custom connectors to any REST API source system
Meltano: pipelines are code, ready to be version controlled, containerized, and deployed continuously. the Singer standard. Transformation as a first-class citizen.
Stitch: moves data from 130+ data sources into a data warehouse, no coding required

Event collection (Behavioural & Event data)

mParticle: integrates all of your data and orchestrates it across channels, partners, and systems
Snowplow: let teams collect, structure, process, and model high-quality data
RudderStack: open-source, warehouse-first customer data pipeline
Segment: collects event from web and mobile apps
Treasure Data: enterprise customer data platform that offers end-to-end, fully-managed cloud service for big data

Utilities

Orchestration

talend
DAGSTER
Airflow
ASTRONOMER

Transformation

dbt
dataform

Data quality monitoring

Amundsen
datakin
MARQUEZ
great_expectations

Data Catalog/Governance

Alation
Alex
asg
Collibra
Congruity
Immuta
INFOGIX
OVALEDGE
Smartlogic
OCTOPAI
/truedat

Operational Analytics

Reverse ETL

hightouch
Census
POLYTOMIC

Applied Analytics

Product Analytics

indicative
rakam

BI / Visualization

SISENSE
Qlik
Metabase
Power BI
plotly
DOMO
GoogleData
MODE
tableau
POPSQL
Looker
ZEPL
ZOOMDATA