Different types of data formats CSV, Parquet, and Feather

Som
2 min readJun 14, 2022
Photo by Markus Spiske on Unsplash

When we are doing data analysis or building models for predicting with help of Machine Learning we came across various kinds of data formats.

In this blog we are gonna discuss about

  • CSV format
  • Parquet format
  • Feather format

CSV format:

The standard format for most of the tabular competitions is CSV. CSV stands for comma-separated values. It’s used to store the values separated by using commas. It’s the most common data type for storing various kinds of tabular datasets.

But there are some disadvantages to using the CSV format. CSV format works perfectly fine when the data size is less ( < 3GB i.e less amount of data) but as the content size grows, the CSV files are not an effective way to store and manipulate data. CSV takes a longer time to read. When the data is of large size (≥15 GBs) reading CSV files with pandas will clog all the ram Thus not a very effective way if you want to store large files.

# for reading the CSV files 
import pandas as pd
df = pd.read_csv("path to csv file")
# for writing to csv fiels
# considering we have dataframe which we want to write under csv file
df.to_csv("fle_save_location.csv", index =False)

Parquet format (.parquet)

Parquet is lightweight for saving data frames. Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just .csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV. The compression is around 22% of the original file size, which is about the same as zipped CSV files.

# for reading parquet files
df = pd.read_parquet("parquet_file_path")
# for writign to the parquet format
df.to_parquet("file_path_tostore.parquet")

Feather format (.ftr)

Photo by Hari Singh Tanwar on Unsplash

Feather format is more efficient compared to parquet format in terms of data retrieval. Though it occupies comparatively more space than parquet format storing in this format will ensure efficient data retrieval.

feather with “ZSTD” compression (for I/O speed): compared to CSV, feather exporting has 20x faster exporting and about 6x times faster importing. The storage is around 32% from the original file size, which is 10% worse than parquet “gzip” and CSV zipped but still decent.

# for reading feather format files
df = pd.read_feather("FILE_PATH_TO_FTR_FILE")
# for writing data into feather format
df.to_feather(pingInfoFilePath)

--

--