Photo by Maksym Kaharlytskyi on Unsplash
The right data format is essential to achieving optimal performance and desired business outcomes. Analysts, data scientists, engineers, and business users need to know these formats in order to make decisions and understand workflows. It is very important to choose the correct file type. It will increase the read time, faster write, files support, Schema evolution can be supported, and advanced compression can be achieved
Before moving to file format we need to understand
What is Row -wise and Column -wise storage of data?
Once we store the data in different File formats .Files will be read by a computer in two types Row wise or Columnar wise.
let’s check the below data ,
Ammu,chennai,40000,kochu,kochi,50000,midhu,banglore,45000
To process this data, a computer would read this data from left to right, starting at the first row and then reading each subsequent row. Storing data in this format is ideal when you need to access one or more entries and all or many columns. Column-based data form store data by column
Ammu,kochu,midhu,chennai,kochi,banglore,40000,50000,45000
In Columnar formats, data is stored sequentially by column, from top to bottom not by row, left to right. Having data grouped by column makes it more efficient to easily focus computation on specific columns of data. Having the data stored sequentially by column allows for a faster scan of the data because all relevant values are stored next to each other. There is no need to search for values within the rows.
At the highest level, column-based storage is most useful when performing analytics queries that require only a subset of columns examined over very large data sets. If your queries require access to all or most of the columns of each row of data, row-based storage will be better suited to your needs
Below are different file formats in Big Data
CSV
JSON
Parquet
Avro
Orc
Prerequisite
For the Detail study of File Format,I am using Databricks. You can try my examples using the Databricks community edition which is absolutely free to use.
1. I have created a notebook in Databrick.>>started the cluster
2. We need sample CSV,JSON,Avro,Orc,Parquet files for the examples
The code for reading and writing files is Databricks,
Below code is in pyspark ( Pyspark is Python API for Apache Spark, an open-source, distributed computing framework and set of libraries for real-time, large-scale data processing. It also provides an optimized API that can read the data from the various data source containing different files formats)
To read a file
Df=spark.read.format(filetype).option(“Header”,True).load(filepath)
This is just one of the way to read the file
To write a file
df.write.mode('overwrite').format(filetype).option("header",True).save(filepath)
file path: is the location from where the file to read/to write the file
file type:We can choose CSV,Parquet,Orc,XML,excel,Arvo,JSON,delta in Databricks
now let’s see the details of each filetype
CSV
CSV files (comma-separated values) are usually used to exchange tabular data between systems using plain text. CSV contains a header row that contains column names for the data, otherwise, files are considered partially structured.Mainly used for small dataset.
CSV is human-readable and easy to edit manually
CSV provides a simple scheme
CSV can be processed by almost all existing applications
CSV is easy to implement and parse
No support for column types
No difference between text and numeric columns
Poor support for special characters
The data must be flat
It is not efficient and cannot handle nested data
I have loaded the file to dbfs of Databricks(file location will be the file path),and read it in CSV format(file type)
This file didn’t have a header that why the first data entry is considered as a heading, the first line will be considered as the header of the files. Files may use separators other than commas, such as tabs or spaces.
JSON
JSON (JavaScript object notation) data are presented as key-value pairs in a partially structured format.If possible convert to more efficient format before processing large amounts of data. Great for small data sets
Human-readable but it can be difficult to read if there are lots of nested fields.
It can store data in a hierarchical format
The data contained in JSON documents can ultimately be stored in more performance-optimized formats such as Parquet or Avro, they serve as raw data,
JSON is not very splittable
JSON lacks indexing
It is less compact as compared to over binary formats
JSON consumes more memory due to repeatable column names;
Poor support for special characters
Above displayed the schema of the JSON file we loaded.
Parquet
It is a Columnar Format.
Not human readable
The scheme travels with the data
Parquet just files, which means it's easy to work, move, backup and replicate them;
Parquet provides very good compression up to 75% when using even compression formats like snappy
Parquet files are immutable and scheme evolution. Of course, Spark knows how to combine the schema if you change it over time (you must specify a special option while reading), but you can only change something in an existing file by overwriting it.
Parquet is used after preprocessing for further analytics because usually all fields are no longer required there
Avro
Avro is a row-based storage format, which is widely used for serialization
Its data is not human-readable;
It Row-based storage
It is a compressed format
Mainly used for the writing operation
Avro format is preferred for loading data lake landing, because downstream systems can easily retrieve table schemas from files, and any source schema changes can be easily handled.
Due to its efficient serialization and deserialization property, it offers good performance.
The schema is stored in JSON format, while the data is stored in binary format which minimizes file size and maximizes efficiency
This allows old software to read new data, and new software to read old data
The schema used to read Avro files does not necessarily have to be the same as the one used to write the files. This allows new fields to be added independently of each other
Avro is usually used to store the raw data because all fields are usually required during ingestion
Orc(Optimized Row Columnar (ORC))
Similar to Parquet, it offers better compression. It also provides better schema evolution support as well, but it is less popular. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. This format was designed to overcome the limitations of other file formats ORC stores collections of rows in one file and within the collection, the row data is stored in a columnar format.
It is in compressed file format
Not human readable format
It is Column based storage
It uses the writing operation
It has better schema evolution support
It have good splittability support(the file can divide into several pieces)
Conclusion
CSV should typically be the fastest to write,
JSON the easiest to understand for humans,
Parquet the fastest to read a subset of columns
Avro is the fastest to read all columns at once.
Parquet and Avro are definitely more optimized for the needs of Big Data splittability, compression support, and excellent support for complex data structures, but readability and writing speed are quite poor.