˟

[Data] Ecosystem of the Modern Data Infrastructure

Apr. 22, 2021

So yesterday, I’ve come acrossed this chart of different technologies employed by enterprises for building a modern data infrastructure. And though some are popular such as AWS (S3, Lambda) or Google (Big Query) some are those I have not hear about. As such I decided to look into each one of them briefly in this post and go into more details by category in subsequent articles.

The requirements for a product to be included in this chart (made by indicative) are:

  1. Connect directly to the cloud data warehouse (or lake) as a single source of truth
  2. Be self-service or open-source
  3. Provide value in <= 3 hours

Storage

Cloud data warehouse

  • amazon Redshift: can work with structured and semi-structured data using standard SQL. Result can be saved to S3, and further analysed with services like EMR, Athena, and SageMaker.
  • Google BigQuery: serverless, highly scallable multi-cloud data warehouse. Can work with structured and semi-structured data using SQL. Query streaming data in real time. Built-in BI engine. Built-in machine learning. Can export ML models to other platforms for prediction.
  • Azure Synapse Analytics: data integration, enterprise data warehousing, big data analytics altogether. Freedom to use serverless or dedicated resources.
  • Yellowbrick: a modern, MPP analytic database. ANSI SQL and PostgreSQL compatible.
  • snowflake: unlimited number of concurrent users and queries, structured and semi-structured data with SQL. JSON and XML.

Data lakes

  • amazon S3: Simple Storage Service is an object storage service. Can use it to store any amount of data for a range of use cases, such as data lakes, websites, enterprise application.
  • Google Cloud Storage: object storage, multiple redundancy options, easy data transfer.
  • Delta Lake: an open-source project that enables building a lakehouse architecture of top of existing storage systems such as S3, AZDS, GCS, and HDFS
  • Azure Data Lake Storage:, limitless scale, single storage platform for ingestion, processing, and visualization

Data Ingestion

ETL/ELT (Databases and 3rd Party System)

  • aws Glue: a serverless data integration service, provides both visual and code-based interfaces
  • Fivetran: support for modern cloud warehouses, robust, automated pipelines
  • Airbyte: open-source data integration platform, the new standard to sync data from applications, APIs, & databases to warehouses, lakes & other destinations
  • Google Cloud Data Fusion: visual point-and-click interface, code-free deployment of ETL/ELT data pipelines
  • Xplenty: implement ETL, ELT using graphic interface, for everyone regardless of tech experience
  • Matillion: comes with an extensive list of pre-built data source connectors, custom connectors to any REST API source system
  • Meltano: pipelines are code, ready to be version controlled, containerized, and deployed continuously. the Singer standard. Transformation as a first-class citizen.
  • Stitch: moves data from 130+ data sources into a data warehouse, no coding required

Event collection (Behavioural & Event data)

  • mParticle: integrates all of your data and orchestrates it across channels, partners, and systems
  • Snowplow: let teams collect, structure, process, and model high-quality data
  • RudderStack: open-source, warehouse-first customer data pipeline
  • Segment: collects event from web and mobile apps
  • Treasure Data: enterprise customer data platform that offers end-to-end, fully-managed cloud service for big data

Utilities

Orchestration

  • talend
  • DAGSTER
  • Airflow
  • ASTRONOMER

Transformation

  • dbt
  • dataform

Data quality monitoring

  • Amundsen
  • datakin
  • MARQUEZ
  • great_expectations

Data Catalog/Governance

  • Alation
  • Alex
  • asg
  • Collibra
  • Congruity
  • Immuta
  • INFOGIX
  • OVALEDGE
  • Smartlogic
  • OCTOPAI
  • /truedat

Operational Analytics

Reverse ETL

  • hightouch
  • Census
  • POLYTOMIC

Applied Analytics

Product Analytics

  • indicative
  • rakam

BI / Visualization

  • SISENSE
  • Qlik
  • Metabase
  • Power BI
  • plotly
  • DOMO
  • GoogleData
  • MODE
  • tableau
  • POPSQL
  • Looker
  • ZEPL
  • ZOOMDATA