When you’re trying to access a CSV file stored in Google Cloud Storage when submitting a job to AI Platform, your first reflex is probably to use pandas’ read_csv. However, this will produce the following error:
ImportError: The gcsfs library is required to handle GCS files
That’s because pandas is not able to read from Storage, natively. However, you can use file_io from tensorflow to access the data and then read it in using pandas.
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import pandas as pd
# read the input data
def read_data_from_gcs(path):
file_stream = file_io.FileIO(path, mode='r')
data = pd.read_csv(StringIO(file_stream.read()))
return data
Make sure to have pandas version below 0.25, as there’s a bug with StringIO in versions above 0.24.
pip install "pandas<0.25.0"
Remember, this will not work locally and should only be used when submitting your job to AI Platform.
Also, if you want to write from AI Platform to Google Cloud Storage, this should do the trick:
def copy_data(job_dir, file_path):
with file_io.FileIO(file_path, mode='rb') as input_f:
with file_io.FileIO(
os.path.join(job_dir, file_path), mode='w+') as output_f:
output_f.write(input_f.read())
Great success!
Your article gave me a lot of inspiration, I hope you can explain your point of view in more detail, because I have some doubts, thank you.
Hi there, I would likoe to subscribe for thi wevlog tto obtin hottedt
updates, soo where ccan i doo itt please helpp out.
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?