Home » Reading JSON Object and Files with Pandas

Reading JSON Object and Files with Pandas

  • by
  • 4 min read

JSON or JavaScript Object Notation is a popular file format for storing semi-structured data. Transforming it to a table is not always easy and sometimes downright ridiculous. However, Pandas offers the possibility via the read_json function. If you are not familiar with the orient argument, you might have a hard time.

First, let’s take a look at the read_json() documentation. In my opinion, one of the most valuable arguments that you can pass to this function is the orient argument, because it give an indication of how your JSON file is structured.

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
'records' : list like [{column -> value}, ... , {column -> value}]
'index' : dict like {index -> {column -> value}}
'columns' : dict like {column -> {index -> value}}
'values' : just the values array

When you want to read your JSON file or object into a Pandas object, you’re gonna have to find the right value for the orient argument. To help you understand these five formats, let’s go over all of them.

Read_json with split

In the following lines of code I created a JSON string that matches the split orientation. You can see that I specified three required keys: columns, index and data.

json_split = json.dumps({
    "columns": ["weight","count"],
    "index": ["apples","bananas","cherries","pears"],
    "data": [[5,20],[8,30],[2,120],[4,16]]
})
pd.read_json(json_split, orient = 'split')

There’s not much flexibility here. If you change the keys (other than columns, index or data), you’ll run into the following error.

in check_keys_split
 raise ValueError(f"JSON data had unexpected key(s): {bad_keys}")
ValueError: JSON data had unexpected key(s)

If you add extra values to a particular row, you’ll be greeted with another error:

in _list_to_arrays
 raise ValueError(e) from e
ValueError: X columns passed, passed data had X columns

However, you can mix data types.

FOLLOW THE GRASP ON LINKEDIN

Curated pieces about data, analytics, machine learning, and artificial intelligence. No bullshit, no high-level promotional stuff trying to sell you a data strategy, no robots coming for your jobs. Just news, opinions and blogs from experts in tech and academics.

You won’t regret it. Trust me.

READ_JSON WITH records

When we use the split orientation, we assume that on every line, we specified a key and a value, with a key matching the column and — for each row — its value matching the values in that column.

json_record = json.dumps([
    {"fruit":"apples","weight":5,"count":20},
    {"fruit":"bananas","weight":8,"count":30},
    {"fruit":"cherries","weight":2,"count":120},
    {"fruit":"pears","weight":4,"count":16}
])
pd.read_json(json_record, orient = 'records')

There’s a lot more flexibility here. Passing more or less key-value pairs on a specific row works just fine. When rows don’t have a specific key, the np.nan value will be inserted instead.

READ_JSON WITH index

Another orientation to suit your semistructured data needs is index. On the highest level, you specify the row index, and on the next level, the key matches the column name.

json_index = json.dumps({
    "apples": {
        "weight": 5, 
        "count": 20
        },
    "bananas": {
        "weight": 8, 
        "count": 30
        },
    "cherries": {
        "weight": 2, 
        "count": 120
        },
    "pears": {
        "weight": 4, 
        "count": 16
        }

})
pd.read_json(json_index, orient = 'index')

As I said, also flexible. It will insert np.nan values in the rows that do not contain a specific key.

READ_JSON WITH Columns

The columns orientation is the pivoted version of the index orientation. On the highest level, you specifiy the columns, while on the next level, the key matches the row index name.

json_columns = json.dumps({
    "weight": {
        "apples": 5,
        "bananas": 8,
        "cherries": 2,
        "pears": 4
    },
    "count": {
        "apples": 20,
        "bananas": 30,
        "cherries": 120,
        "pears": 16
    }
})
pd.read_json(json_columns, orient = 'columns')

It’s also very flexible, because you can specify indices in one column, without specifying them in the other; np.nan will be inserted.

READ_JSON WITH values

Finally, when your JSON file or object does not contain any column or index names, go for the values orientation. Just like the previous orientations, the values per row can differ.

json_values = json.dumps([
    ["apples", 5, 20],
    ["bananas", 8, 30],
    ["cherries", 2,120],
    ["pears", 4, 16]

])
pd.read_json(json_values, orient = 'values')

To conclude, there’s also the table orientation. In my opinion, it’s unlikely you’ll find this “in the wild”. It seems like an efficient way to store sparse data frames, however.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *