When you’re trying to access a CSV file stored in Google Cloud Storage when submitting a job to AI Platform, your first reflex is probably to use pandas’ read_csv. However, this will produce the following error:
ImportError: The gcsfs library is required to handle GCS files
That’s because pandas is not able to read from Storage, natively. However, you can use file_io from tensorflow to access the data and then read it in using pandas.
from tensorflow.python.lib.io import file_io from pandas.compat import StringIO import pandas as pd # read the input data def read_data_from_gcs(path): file_stream = file_io.FileIO(path, mode='r') data = pd.read_csv(StringIO(file_stream.read())) return data
Make sure to have pandas version below 0.25, as there’s a bug with StringIO in versions above 0.24.
pip install "pandas<0.25.0"
Remember, this will not work locally and should only be used when submitting your job to AI Platform.
Also, if you want to write from AI Platform to Google Cloud Storage, this should do the trick:
def copy_data(job_dir, file_path): with file_io.FileIO(file_path, mode='rb') as input_f: with file_io.FileIO( os.path.join(job_dir, file_path), mode='w+') as output_f: output_f.write(input_f.read())