When working on projects, I use pandas library to process and move my data around. It really works great on moderate-size datasets. However, when the number of observations in our dataset is high, the process of saving and loading data becomes slower and know each kernel steals your time and forces you to wait until the data reloads. So eventually,the CSV files or any other plain-text formats lose their attractiveness.
There are plenty of binary formats to store the data on disk and many of them pandas supports.Few are Feather, Pickle, HDF5, Parquet, Dask, Datatable. Here we can learn how we can use Feather to write and read the datasets.
what is Feather:
“Feather” — A fast, lightweight, language agnostic and easy-to-use binary file format for storing data frames. It is language agnostic! It uses Apache Arrow columnar memory specification to represent binary data on disk.
Light weight, minimal API which makes pushing data frames in and out of memory as simple as possible.
It is language agnostic, that means these files are same regardless if written in Python or R. It can be written in other languages too.
High read and write performance.
Feather is not designed for long-term data storage. At this time, we don't guarantee that there file format will be stable between versions.
Installation is simple. For Python,
pip install feather-format
conda install -c conda-forge feather-format
Write DataFrame to the binaryFeather Format.
path: str or file - like object
If a string, it will be used as Root Directory path.
Additional Keywords passed to pyarrow.feather.write_feather( ).
Starting with pyarrow 0.17, this includes the compression, compression_level , chunk size and version keywords.
Load a feather-format object from the file path.
path: str, path object or file - like object
Any valid string path is acceptable.The string could be a URL. Valid URL schemes include http, ftp, s3 and file.For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.feather.
if you want to pass in a path in a path object, pandas accepts any os.PathLike.
columns: sequence, default: None
if not provided, all columns are read.
use_threads: bool,default: True
Whether to parallelized reading using multiple threads.
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
Returns: type of object stored in file