top of page
ammufredy

BIG DATA FILE FORMAT :DETAIL STUDY USING DATABRICKS





The right data format is essential to achieving optimal performance and desired business outcomes. Analysts, data scientists, engineers, and business users need to know these formats in order to make decisions and understand workflows. It is very important to choose the correct file type. It will increase the read time, faster write, files support, Schema evolution can be supported, and advanced compression can be achieved

Before moving to file format we need to understand

What is Row -wise and Column -wise storage of data?

Once we store the data in different File formats .Files will be read by a computer in two types Row wise or Columnar wise.

let’s check the below data ,


Ammu,chennai,40000,kochu,kochi,50000,midhu,banglore,45000


To process this data, a computer would read this data from left to right, starting at the first row and then reading each subsequent row. Storing data in this format is ideal when you need to access one or more entries and all or many columns. Column-based data form store data by column


Ammu,kochu,midhu,chennai,kochi,banglore,40000,50000,45000


In Columnar formats, data is stored sequentially by column, from top to bottom not by row, left to right. Having data grouped by column makes it more efficient to easily focus computation on specific columns of data. Having the data stored sequentially by column allows for a faster scan of the data because all relevant values are stored next to each other. There is no need to search for values within the rows.

At the highest level, column-based storage is most useful when performing analytics queries that require only a subset of columns examined over very large data sets. If your queries require access to all or most of the columns of each row of data, row-based storage will be better suited to your needs

Below are different file formats in Big Data

CSV

JSON

Parquet

Avro

Orc


Prerequisite

For the Detail study of File Format,I am using Databricks. You can try my examples using the Databricks community edition which is absolutely free to use.

1. I have created a notebook in Databrick.>>started the cluster

2. We need sample CSV,JSON,Avro,Orc,Parquet files for the examples


The code for reading and writing files is Databricks,

Below code is in pyspark ( Pyspark is Python API for Apache Spark, an open-source, distributed computing framework and set of libraries for real-time, large-scale data processing. It also provides an optimized API that can read the data from the various data source containing different files formats)


  • To read a file

Df=spark.read.format(filetype).option(“Header”,True).load(filepath)

This is just one of the way to read the file

  • To write a file

df.write.mode('overwrite').format(filetype).option("header",True).save(filepath)


file path: is the location from where the file to read/to write the file

file type:We can choose CSV,Parquet,Orc,XML,excel,Arvo,JSON,delta in Databricks

now let’s see the details of each filetype


CSV


CSV files (comma-separated values) are usually used to exchange tabular data between systems using plain text. CSV contains a header row that contains column names for the data, otherwise, files are considered partially structured.Mainly used for small dataset.

  • CSV is human-readable and easy to edit manually

  • CSV provides a simple scheme

  • CSV can be processed by almost all existing applications

  • CSV is easy to implement and parse

  • No support for column types

  • No difference between text and numeric columns

  • Poor support for special characters

  • The data must be flat

  • It is not efficient and cannot handle nested data





I have loaded the file to dbfs of Databricks(file location will be the file path),and read it in CSV format(file type)

This file didn’t have a header that why the first data entry is considered as a heading, the first line will be considered as the header of the files. Files may use separators other than commas, such as tabs or spaces.


JSON



JSON (JavaScript object notation) data are presented as key-value pairs in a partially structured format.If possible convert to more efficient format before processing large amounts of data. Great for small data sets

  • Human-readable but it can be difficult to read if there are lots of nested fields.

  • It can store data in a hierarchical format

  • The data contained in JSON documents can ultimately be stored in more performance-optimized formats such as Parquet or Avro, they serve as raw data,

  • JSON is not very splittable

  • JSON lacks indexing

  • It is less compact as compared to over binary formats

  • JSON consumes more memory due to repeatable column names;

  • Poor support for special characters

Above displayed the schema of the JSON file we loaded.




Parquet


  • It is a Columnar Format.

  • Not human readable

  • The scheme travels with the data

  • Parquet just files, which means it's easy to work, move, backup and replicate them;

  • Parquet provides very good compression up to 75% when using even compression formats like snappy

  • Parquet files are immutable and scheme evolution. Of course, Spark knows how to combine the schema if you change it over time (you must specify a special option while reading), but you can only change something in an existing file by overwriting it.

  • Parquet is used after preprocessing for further analytics because usually all fields are no longer required there


Avro


Avro is a row-based storage format, which is widely used for serialization


  • Its data is not human-readable;

  • It Row-based storage

  • It is a compressed format

  • Mainly used for the writing operation

  • Avro format is preferred for loading data lake landing, because downstream systems can easily retrieve table schemas from files, and any source schema changes can be easily handled.

  • Due to its efficient serialization and deserialization property, it offers good performance.

  • The schema is stored in JSON format, while the data is stored in binary format which minimizes file size and maximizes efficiency

  • This allows old software to read new data, and new software to read old data

  • The schema used to read Avro files does not necessarily have to be the same as the one used to write the files. This allows new fields to be added independently of each other

  • Avro is usually used to store the raw data because all fields are usually required during ingestion


Orc(Optimized Row Columnar (ORC))



Similar to Parquet, it offers better compression. It also provides better schema evolution support as well, but it is less popular. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. This format was designed to overcome the limitations of other file formats ORC stores collections of rows in one file and within the collection, the row data is stored in a columnar format.

  • It is in compressed file format

  • Not human readable format

  • It is Column based storage

  • It uses the writing operation

  • It has better schema evolution support

  • It have good splittability support(the file can divide into several pieces)



Conclusion

  • CSV should typically be the fastest to write,

  • JSON the easiest to understand for humans,

  • Parquet the fastest to read a subset of columns

  • Avro is the fastest to read all columns at once.

Parquet and Avro are definitely more optimized for the needs of Big Data splittability, compression support, and excellent support for complex data structures, but readability and writing speed are quite poor.

145 views

Recent Posts

See All
bottom of page