>
[Data] Ecosystem of the Modern Data Infrastructure
Apr. 22, 2021So yesterday, I’ve come acrossed this chart of different technologies employed by enterprises for building a modern data infrastructure. And though some are popular such as AWS (S3, Lambda) or Google (Big Query) some are those I have not hear about. As such I decided to look into each one of them briefly in this post and go into more details by category in subsequent articles.
The requirements for a product to be included in this chart (made by indicative) are:
- Connect directly to the cloud data warehouse (or lake) as a single source of truth
- Be self-service or open-source
- Provide value in <= 3 hours
Storage
Cloud data warehouse
- amazon Redshift: can work with structured and semi-structured data using standard SQL. Result can be saved to S3, and further analysed with services like EMR, Athena, and SageMaker.
- Google BigQuery: serverless, highly scallable multi-cloud data warehouse. Can work with structured and semi-structured data using SQL. Query streaming data in real time. Built-in BI engine. Built-in machine learning. Can export ML models to other platforms for prediction.
- Azure Synapse Analytics: data integration, enterprise data warehousing, big data analytics altogether. Freedom to use serverless or dedicated resources.
- Yellowbrick: a modern, MPP analytic database. ANSI SQL and PostgreSQL compatible.
- snowflake: unlimited number of concurrent users and queries, structured and semi-structured data with SQL. JSON and XML.
Data lakes
- amazon S3: Simple Storage Service is an object storage service. Can use it to store any amount of data for a range of use cases, such as data lakes, websites, enterprise application.
- Google Cloud Storage: object storage, multiple redundancy options, easy data transfer.
- Delta Lake: an open-source project that enables building a lakehouse architecture of top of existing storage systems such as S3, AZDS, GCS, and HDFS
- Azure Data Lake Storage:, limitless scale, single storage platform for ingestion, processing, and visualization
Data Ingestion
ETL/ELT (Databases and 3rd Party System)
- aws Glue: a serverless data integration service, provides both visual and code-based interfaces
- Fivetran: support for modern cloud warehouses, robust, automated pipelines
- Airbyte: open-source data integration platform, the new standard to sync data from applications, APIs, & databases to warehouses, lakes & other destinations
- Google Cloud Data Fusion: visual point-and-click interface, code-free deployment of ETL/ELT data pipelines
- Xplenty: implement ETL, ELT using graphic interface, for everyone regardless of tech experience
- Matillion: comes with an extensive list of pre-built data source connectors, custom connectors to any REST API source system
- Meltano: pipelines are code, ready to be version controlled, containerized, and deployed continuously. the Singer standard. Transformation as a first-class citizen.
- Stitch: moves data from 130+ data sources into a data warehouse, no coding required
Event collection (Behavioural & Event data)
- mParticle: integrates all of your data and orchestrates it across channels, partners, and systems
- Snowplow: let teams collect, structure, process, and model high-quality data
- RudderStack: open-source, warehouse-first customer data pipeline
- Segment: collects event from web and mobile apps
- Treasure Data: enterprise customer data platform that offers end-to-end, fully-managed cloud service for big data
Utilities
Orchestration
- talend
- DAGSTER
- Airflow
- ASTRONOMER
Transformation
- dbt
- dataform
Data quality monitoring
- Amundsen
- datakin
- MARQUEZ
- great_expectations
Data Catalog/Governance
- Alation
- Alex
- asg
- Collibra
- Congruity
- Immuta
- INFOGIX
- OVALEDGE
- Smartlogic
- OCTOPAI
- /truedat
Operational Analytics
Reverse ETL
- hightouch
- Census
- POLYTOMIC
Applied Analytics
Product Analytics
- indicative
- rakam
BI / Visualization
- SISENSE
- Qlik
- Metabase
- Power BI
- plotly
- DOMO
- GoogleData
- MODE
- tableau
- POPSQL
- Looker
- ZEPL
- ZOOMDATA