Reading JSON Object and Files with Pandas

JSON or JavaScript Object Notation is a popular file format for storing semi-structured data. Transforming it to a table is not always easy and sometimes downright ridiculous. However, Pandas offers the possibility via the read_json function. If you are not familiar with the orient argument, you might have a hard time.

First, let’s take a look at the read_json() documentation. In my opinion, one of the most valuable arguments that you can pass to this function is the orient argument, because it give an indication of how your JSON file is structured.

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
'records' : list like [{column -> value}, ... , {column -> value}]
'index' : dict like {index -> {column -> value}}
'columns' : dict like {column -> {index -> value}}
'values' : just the values array

When you want to read your JSON file or object into a Pandas object, you’re gonna have to find the right value for the orient argument. To help you understand these five formats, let’s go over all of them.

Read_json with split

In the following lines of code I created a JSON string that matches the split orientation. You can see that I specified three required keys: columns, index and data.

json_split = json.dumps({
    "columns": ["weight","count"],
    "index": ["apples","bananas","cherries","pears"],
    "data": [[5,20],[8,30],[2,120],[4,16]]
})
pd.read_json(json_split, orient = 'split')

There’s not much flexibility here. If you change the keys (other than columns, index or data), you’ll run into the following error.

in check_keys_split
 raise ValueError(f"JSON data had unexpected key(s): {bad_keys}")
ValueError: JSON data had unexpected key(s)

If you add extra values to a particular row, you’ll be greeted with another error:

in _list_to_arrays
 raise ValueError(e) from e
ValueError: X columns passed, passed data had X columns

However, you can mix data types.

READ_JSON WITH records

When we use the split orientation, we assume that on every line, we specified a key and a value, with a key matching the column and — for each row — its value matching the values in that column.

json_record = json.dumps([
    {"fruit":"apples","weight":5,"count":20},
    {"fruit":"bananas","weight":8,"count":30},
    {"fruit":"cherries","weight":2,"count":120},
    {"fruit":"pears","weight":4,"count":16}
])
pd.read_json(json_record, orient = 'records')

There’s a lot more flexibility here. Passing more or less key-value pairs on a specific row works just fine. When rows don’t have a specific key, the np.nan value will be inserted instead.

READ_JSON WITH index

Another orientation to suit your semistructured data needs is index. On the highest level, you specify the row index, and on the next level, the key matches the column name.

json_index = json.dumps({
    "apples": {
        "weight": 5, 
        "count": 20
        },
    "bananas": {
        "weight": 8, 
        "count": 30
        },
    "cherries": {
        "weight": 2, 
        "count": 120
        },
    "pears": {
        "weight": 4, 
        "count": 16
        }

})
pd.read_json(json_index, orient = 'index')

As I said, also flexible. It will insert np.nan values in the rows that do not contain a specific key.

READ_JSON WITH Columns

The columns orientation is the pivoted version of the index orientation. On the highest level, you specifiy the columns, while on the next level, the key matches the row index name.

json_columns = json.dumps({
    "weight": {
        "apples": 5,
        "bananas": 8,
        "cherries": 2,
        "pears": 4
    },
    "count": {
        "apples": 20,
        "bananas": 30,
        "cherries": 120,
        "pears": 16
    }
})
pd.read_json(json_columns, orient = 'columns')

It’s also very flexible, because you can specify indices in one column, without specifying them in the other; np.nan will be inserted.

READ_JSON WITH values

Finally, when your JSON file or object does not contain any column or index names, go for the values orientation. Just like the previous orientations, the values per row can differ.

json_values = json.dumps([
    ["apples", 5, 20],
    ["bananas", 8, 30],
    ["cherries", 2,120],
    ["pears", 4, 16]

])
pd.read_json(json_values, orient = 'values')

There’s also the table orientation. In my opinion, it’s unlikely you’ll find this “in the wild”. It seems like an efficient way to store sparse data frames, however.

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Happy coding!

Say thanks, ask questions or give feedback

1 thought on “Reading JSON Object and Files with Pandas”

Happiman July 24, 2022 at 5:16 am

When we use the split orientation, we assume that on every line, we specified a key and a value,
-> When we use the “records” orientation, we assume that on every line, we specified a key and a value,

Reading JSON Object and Files with Pandas

Read_json with split

READ_JSON WITH records

READ_JSON WITH index

READ_JSON WITH Columns

READ_JSON WITH values

Say thanks, ask questions or give feedback

Say thanks, ask questions or give feedback

1 thought on “Reading JSON Object and Files with Pandas”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error