Home ยป Reading from and writing files to GCP Storage in an AI Platform job

Reading from and writing files to GCP Storage in an AI Platform job

  • by
ai-platform-csv
Want to do a random act of kindness? Share this post.

When you’re trying to access a CSV file stored in Google Cloud Storage when submitting a job to AI Platform, your first reflex is probably to use pandas’ read_csv. However, this will produce the following error:

ImportError: The gcsfs library is required to handle GCS files

That’s because pandas is not able to read from Storage, natively. However, you can use file_io from tensorflow to access the data and then read it in using pandas.

from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import pandas as pd

# read the input data
def read_data_from_gcs(path):  
   file_stream = file_io.FileIO(path, mode='r')
   data = pd.read_csv(StringIO(file_stream.read()))
   return data

Make sure to have pandas version below 0.25, as there’s a bug with StringIO in versions above 0.24.

pip install "pandas<0.25.0"

Remember, this will not work locally and should only be used when submitting your job to AI Platform.

Also, if you want to write from AI Platform to Google Cloud Storage, this should do the trick:

def copy_data(job_dir, file_path):
	with file_io.FileIO(file_path, mode='rb') as input_f:
		with file_io.FileIO(
			os.path.join(job_dir, file_path), mode='w+') as output_f:
	  			output_f.write(input_f.read())

Great success!

Want to do a random act of kindness? Share this post.