0

I have JSON data that contains three different types of records that I need to read, parse into each type, and store as three separate data frames. I used dummy data here since the actual records are quite long. The data have different numbers of columns for each record type with some column names in common. The schema are basically:

schema 1: one record per item ID with basic characteristics in wide format that looks like:

[id,year,object_name,base_price]

schema 2 and 3: each are a set of different features in long format for a single id that look like:

[id, feature1, price1]

[id, feature2, price2]

[id, feature1, size1]

[id, feature2, size2]

Here's what the JSON data looks like

[

{"id": ["500068"], "year": ["2018"], "object_name": "barn"},

{"id": ["500069"], "year": ["2018"], "object_name": "well"},

{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},

{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},

{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},

{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},

{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},

{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},

]

I tried getting the JSON into a format that i could try to do this in, but I get stuck here.

import pandas as pd

with open(file, 'r') as f:
    data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)

# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas 
# (I get stuck here since they are different record types
data_df = pd.read_json(data_json_str)

I would like the data to be read into three different pandas data frames with the appropriate column names. Thanks for the help!

1 Answer 1

0

To create 3 different data frames from various type of column names of json data, you can try the following code.

import pandas as pd

json_data='[{"id": ["500068"], "year": ["2018"], "object_name": "barn"},{"id": ["500069"], "year": ["2018"], "object_name": "well"},{"id": ["500068"], "year": ["2018"], "feature": "extension","price": "65"},{"id": ["500068"], "year": ["2018"], "feature": "stalls","price": "72"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "cover","price": "55"},{"id": ["500070"], "year": ["2018"], "feature": "depth_extension","size": "1050"},{"id": ["500070"], "year": ["2018"], "feature": "cover","size": "1160"},]'
list_data=eval(json_data)

df=pd.DataFrame(list_data)

df1=df[df['size'].notnull()]
df1=df1.drop(['object_name','price'], 1)
print(df1)
print('\n')

df2=df[df['price'].notnull()]
df2=df2.drop(['object_name','size'], 1)
print(df2)
print('\n')

df3=df[df['object_name'].notnull()]
df3=df3.drop(['feature', 'price', 'size'], 1)
print(df3)
print('\n')

The output:

           feature        id  size    year
6  depth_extension  [500070]  1050  [2018]
7            cover  [500070]  1160  [2018]


           feature        id price    year
2        extension  [500068]    65  [2018]
3           stalls  [500068]    72  [2018]
4  depth_extension  [500070]    55  [2018]
5            cover  [500070]    55  [2018]


         id object_name    year
0  [500068]        barn  [2018]
1  [500069]        well  [2018]

Hope this can help you, have a nice day. :)

1
  • Ah, easy and elegant answer! Thank you!
    – mth10
    May 18, 2019 at 19:16

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.