Parquet | Notion

Type: Columnar storage format for big data, commonly used within the Hadoop ecosystem.
Structure:
- Stores data in columns rather than rows, enabling efficient compression and querying.
- Includes metadata describing the schema and structure of the data.

Strengths

Compression: Highly compressible, reducing storage footprints and associated costs.
Optimized for Analytics: Columnar storage allows you to read only the columns relevant to a query, minimizing I/O and boosting query performance.
Self-Describing: Includes metadata, making it easier to work with data without requiring external schema definitions.
Splittable: Parquet files can be split and processed in parallel, benefiting distributed systems.
Wide Ecosystem Support: Compatible with Apache Spark, Hadoop, Presto, and many other big data tools.

Weaknesses

Writes Can Be Slower: Due to its columnar structure, writes to Parquet files may be slower than row-based formats in some scenarios.
Not ideal for Frequent Updates: Better suited for write-once, read-many workloads. Updates/deletes can be complex.
Small Files Overhead: Storing many small Parquet files can introduce metadata management overhead.

Usable Data Formats

Parquet is designed to store structured data and works well with:

Tabular Data: Data with well-defined columns and types (e.g., CSV-like data).
Semi-structured Data: Data formats like JSON or Avro, having a defined schema.