The Very Basic Data Parsing in Python

There are a coupe of things that are basic, frequent, and useful in python data wrangling: create a list, dataframe and dictionary; identify the position by index value; slice dice out the target element from a list, a dataframe and a dictionary…

First, create list, dataframe, and dictionary

dates = pd.date_range(‘20130101’, periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns = list(‘ABCD’))

df2 = pd.DataFrame({‘A’ : 1.,
….: ‘B’ : pd.Timestamp(‘20130102’),
….: ‘C’ : pd.Series(1,index=list(range(4)),dtype=’float32′),
….: ‘D’ : np.array([3] * 4,dtype=’int32′),
….: ‘E’ : pd.Categorical([“test”,”train”,”test”,”train”]),
….: ‘F’ : ‘foo’})

In the below example, not only the concept of list, dataframe is shown, but also the merge function.
left = pd.DataFrame({‘key’:[‘foo’, ‘foo’], ‘lval’:[1, 2]})
right = pd.DataFrame({‘key’: [‘foo’, ‘foo’], ‘rval’: [4, 5]})
pd.merge(left, right, on=’key’)

Second, identify the position by inex value

a = pd.Index([‘c’, ‘b’, ‘a’])
b = pd.Index([‘c’, ‘e’, ‘d’])
a | b
Out[294]: Index([u’a’, u’b’, u’c’, u’d’, u’e’], dtype=’object’)
a & b
Out[295]: Index([u’c’], dtype=’object’)
a.difference(b)
Out[296]: Index([u’a’, u’b’], dtype=’object’)

We can reindex by reset_index() and also enforce an index: benchmark.index = dates[1:]

To drop an element at an index position: games.drop(games.index[516],inplace=True)

Even can group by index directly: series.groupby(series.index.hour).mean()

Note index can be inserted by a list, for instance, [2001, 2002, 2003], and the name of index can be added or altered later.

pop = {‘Nevada’: {2001: 2.4, 2002: 2.9},
….: ‘Ohio’: {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3.T
pd.DataFrame(pop, index=[2001, 2002, 2003])
In [64]: frame3.index.name = ‘year’; frame3.columns.name = ‘state’ #index name change

it look like this:

index pic

index name shall be shown in plot by ax.set_ylabel(df.index.name)

the use of where to find the location, which usualy is regarded as proxy to index:

loc = np.where(a==11)
output: (array([2], dtype=int64), array([1], dtype=int64))

or for a dataframe: indices = np.where(universe.fs_comp_uid == ‘000CS1-E’)
output: (array([ 217, 1163], dtype=int64),)

Third, slice dice out the target element from a list, a dataframe, and a dictionary

mask = tmi_1[‘lsd_shs_float’].isnull()
tmi_1[‘float_share’][mask] = tmi_1[‘ff_shs_float’] #set a value on the condition of another column conditional on a column

df = pd.DataFrame(
…: {‘AAA’ : [4,5,6,7], ‘BBB’ : [10,20,30,40],’CCC’ : [100,50,-30,-50]}); df
…:
#if then logic
df.ix[df.AAA >= 5,[‘BBB’,’CCC’]] = 555; df
df.ix[df.AAA < 5,[‘BBB’,’CCC’]] = 2000; df
df_mask = pd.DataFrame({‘AAA’ : [True] * 4, ‘BBB’ : [False] * 4,’CCC’ : [True,False] * 2})
df.where(df_mask,-1000) #replace the value with -1000

df[‘logic’] = np.where(df[‘AAA’] > 5,’high’,’low’); df

where

df.iloc[:, 1:2]

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s