Prep dataframe to insert to database

Prep dataframe to insert to database

Situation: When inserting to postgresql with sqlalchemy, the to_sql() function works, but we need to make sure we’re appending the id (primary key) column the right way – incrementally.

This will involve manipulating the dataframe with by incremeting with the max_id before using reset_index() to create an additional column using the natural index, then setting the index=False parameter.

# change column name
# id, graph_id, amount_display, from_address, to_address, tx_timestamp, timestamp_display

# use rename function to change Two column names, set inplace=False to preserve original dataframe column name
df2 = df.rename(columns={'id': 'graph_id',
                         'timestamp': 'tx_timestamp'}, inplace=False)


# reorder dataframe column using list of names
# list of names (in same order as stg_subgraph_bank)
list_of_col_names = ['graph_id', 'amount_display', 'from_address',
                     'to_address', 'tx_timestamp', 'timestamp_display']
df2 = df2.filter(list_of_col_names)

This next part is KEY:

df2.index += max_id  # increment with max_id
df2 = df2.reset_index()  # reset index to later increment with max_id

df3 = df2.rename(columns={'index': 'id'}, inplace=False)

# only do this step if you've made sure to duplicate a test table in postgresql, then ensure that the dataframe is in the same shape as the postgresql table
df3.to_sql('stg_subgraph_bank_1', con=db, if_exists='append', index=False)

For more content on data science, R, and Python find me on Twitter.

Previous
Next