Skip to content
Home ยป Reading from and writing files to GCP Storage in an AI Platform job

Reading from and writing files to GCP Storage in an AI Platform job

When you’re trying to access a CSV file stored in Google Cloud Storage when submitting a job to AI Platform, your first reflex is probably to use pandas’ read_csv. However, this will produce the following error:

ImportError: The gcsfs library is required to handle GCS files

That’s because pandas is not able to read from Storage, natively. However, you can use file_io from tensorflow to access the data and then read it in using pandas.

from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import pandas as pd

# read the input data
def read_data_from_gcs(path):  
   file_stream = file_io.FileIO(path, mode='r')
   data = pd.read_csv(StringIO(file_stream.read()))
   return data

Make sure to have pandas version below 0.25, as there’s a bug with StringIO in versions above 0.24.

pip install "pandas<0.25.0"

Remember, this will not work locally and should only be used when submitting your job to AI Platform.

Also, if you want to write from AI Platform to Google Cloud Storage, this should do the trick:

def copy_data(job_dir, file_path):
	with file_io.FileIO(file_path, mode='rb') as input_f:
		with file_io.FileIO(
			os.path.join(job_dir, file_path), mode='w+') as output_f:
	  			output_f.write(input_f.read())

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

1 thought on “Reading from and writing files to GCP Storage in an AI Platform job”

Leave a Reply

Your email address will not be published. Required fields are marked *