Published: September 6, 2021
Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension .parquet. This blog post aims to understand how parquet works and the tricks it uses to efficiently store data. Key features of parquet are: it’s cross platform it’s a recognised file format used by many systems it stores data in a column layout it stores metadata The latter two points allow for efficient storage and querying of data.