How to start using Pandas immediately for Earth Data Analysis (codes included)

This tutorial gives a brief description of scientific computing using Pandas by introducing Series, DataFrame, Pandas common operations, methods, conditional operations, groupby operations, and reading and writing data

Pandas (stands for Panel Data library) is an open-source library for Python that comes with many tools that make the task of data analysis several folds easier. It is built upon the Numpy Library in Python. For details on Numpy, please check my previous post:

In this tutorial, we will focus on two main data structures Pandas has - Series and DataFrames. We will see how we can perform groupby operations and read/write tabular data directly using Pandas.

Pandas Series

Pandas series is similar to numpy array but it can have ”named index”. Naming indexes helps a lot in data analysis.

Similar posts

Convert any text to lifelike speech using amazon polly

Let’s create a series:

Create Pandas Series: Method 1 (detailed way)

import numpy as np
import pandas as pd

np.random.seed(0)
labels = ['x', 'y', 'z' ]
arr = np.random.randn(3)


This gives a numpy array:

[1.76405235 0.40015721 0.97873798]


Now, let’s create a dictionary using the above labels and array:

mydict = {lab:val for lab, val in zip(labels, arr)}
print(mydict)


=>

{'x': 1.764052345967664, 'y': 0.4001572083672233, 'z': 0.9787379841057392}


We can simply use this dictionary to create a Pandas series

myseries = pd.Series(data = mydict)
print(myseries)


=>

x    1.764052
y    0.400157
z    0.978738
dtype: float64


Create Pandas Series: Method 2

We can also create Pandas series directly:

myseries2 = pd.Series(data = arr, index = labels)
print(myseries2)


=>

x    1.764052
y    0.400157
z    0.978738
dtype: float64


Here, our data was all float (also interpreted by Pandas). But we can mix different data types together and Pandas will still be fine.

Extract information from Pandas Series

We can use the index to extract the information from the series.

print(myseries2['x'])


=>

1.764052345967664


Operations on Pandas Series

Let us define two series and perform operations on them

labels1 = ['x', 'y', 'z' ]
arr1 = np.random.randn(3)
labels2 = ['xx', 'y', 'z' ]

arr2 = np.random.randn(3)
myseries1 = pd.Series(data = arr1, index = labels1)
myseries2 = pd.Series(data = arr2, index = labels2)

print("myseries1: \n",myseries1)
print("myseries2: \n",myseries2)


=>

myseries1:
x   -0.187184
y    1.532779
z    1.469359
dtype: float64

myseries2:
xx    0.154947
y     0.378163
z    -0.887786
dtype: float64


Now, let’s sum the two above series

myseries = myseries1 + myseries2
print(myseries)


=>

x          NaN
xx         NaN
y     1.910942
z     0.581573
dtype: float64


You can notice that the pandas is quite forgiving. Even though there was no data for xx in myseries1 and x in myseries2, Pandas did not throw errors but fill the missing values with NaN for us.

Pandas DataFrames

When we combine multiple series with common index, then we get DataFrame.

Create DataFrame from random matrix

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat)
print(mydf)


	0         1         2
0  0.955057  0.190794  1.978757
1  2.605967  0.683509  0.302665
2  1.693723 -1.706086 -1.159119
3 -0.134841  0.390528  0.166905
4  0.184502  0.807706  0.072960


Now, let us try to name the index and columns

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf)


This will return

	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845
Z -1.382920  1.482495  0.961458


Extract Series from Pandas DataFrame

print(mydf['C1'])


M    0.386030
N    0.230336
X   -0.031160
Y   -0.741790
Z   -1.382920
Name: C1, dtype: float64

columnList = ['C1', 'C2']
print(mydf[columnList])


	C1        C2
M  0.386030  2.084019
N  0.230336  0.681209
X -0.031160  1.939932
Y -0.741790  0.187125
Z -1.382920  1.482495


Add new column to the existing Pandas Dataframe

Now, let us add a new column to the existing dataframe.

mydf['C12'] = mydf['C1'] + mydf['C2']
print(mydf)

	C1        C2        C3       C12
M  0.386030  2.084019 -0.376519  2.470049
N  0.230336  0.681209  1.035125  0.911546
X -0.031160  1.939932 -1.005187  1.908772
Y -0.741790  0.187125 -0.732845 -0.554665
Z -1.382920  1.482495  0.961458  0.099575


Remove the existing column or index from Pandas Dataframe

mydf.drop('C12', axis=1, inplace=True) #by default axis=0

	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845
Z -1.382920  1.482495  0.961458


Two things to notice here. Pandas by default take the index names to drop and it does not permanently remove the index or column unless specified. If we want to remove the index then we provide axis=0 with the index name.

mydf.drop('Z', inplace=True) #by default axis=0


	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845


loc and iloc method to extract values from the Pandas Dataframe

We can use loc or iloc method to extract the row from the DataFrame:

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

print(mydf.loc['M']) #uses the column name

print(mydf.iloc[0]) #uses index

	C1        C2        C3
M -0.971393 -1.522333  1.133703
N  0.528187  0.393461 -0.630507
X -1.398290 -0.219311 -0.045676
Y  0.012421  0.093628  1.240813
Z -1.097693 -1.908009 -0.380104

C1   -0.971393
C2   -1.522333
C3    1.133703
Name: M, dtype: float64

C1   -0.971393
C2   -1.522333
C3    1.133703
Name: M, dtype: float64


We can also extract multiple rows by using a list of rows:

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

print(mydf.loc[['M', 'X']])

print(mydf.iloc[[0, 2]])

	C1        C2        C3
M -1.666059 -2.736995  1.522562
N  0.178009 -0.626805 -0.391089
X  1.743477  1.130018  0.897796
Y  0.330866 -1.063049 -0.125381
Z -0.945588  2.029544 -1.046358

C1        C2        C3
M -1.666059 -2.736995  1.522562
X  1.743477  1.130018  0.897796

C1        C2        C3
M -1.666059 -2.736995  1.522562
X  1.743477  1.130018  0.897796


How can we extract the columns?

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

print(mydf.loc[['M', 'N'],'C1'])

print(mydf.iloc[[0, 1],0])

	C1        C2        C3
M -1.549671  0.435253  1.259904
N -0.447898  0.266207  0.412580
X  0.988773  0.513833 -0.928205
Y  0.846904 -0.298436  0.029141
Z  0.889031 -1.839261  0.863596

M   -1.549671
N   -0.447898
Name: C1, dtype: float64

M   -1.549671
N   -0.447898
Name: C1, dtype: float64


Conditional Operations on Pandas DataFrames

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

	C1        C2        C3
M  1.035125 -0.031160  1.939932
N -1.005187 -0.741790  0.187125
X -0.732845 -1.382920  1.482495
Y  0.961458 -2.141212  0.992573
Z  1.192241 -1.046780  1.292765


Now, we want to find all the values in the DataFrame that are positive and mask all others. This can be easily done using:

mydf = mydf[mydf>0]

	C1  C2        C3
M  1.035125 NaN  1.939932
N       NaN NaN  0.187125
X       NaN NaN  1.482495
Y  0.961458 NaN  0.992573
Z  1.192241 NaN  1.292765


We can do the same thing for just a particular column:

mydf = mydf[mydf['C2']>0]

	C1        C2        C3
M  0.649148  0.358941 -1.080471
N  0.902398  0.161781  0.833029
Y -0.708954  0.586847 -1.621348
Z  0.677535  0.026105 -1.678284


We can use the value_counts() method (for series) for counting the number of C2>0. There are many other ways we can retrieve such information.

We can also use more than one condition. Please note that in this case I regenerated the mydf using the numpy seed of 0.

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

  C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

mydf2 = mydf[(mydf['C1']>0) & (mydf['C2']>1)]

			C1        C2        C3
N  2.240893  1.867558 -0.977278

 Pandas conditioning operations uses “&” and “ ”.

Create new index for the dataframe

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

	C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863


Now, we create a new column with new index values

my_new_idx = "MM NN XX YY ZZ".split()
mydf['newidx'] = my_new_idx

	C1        C2        C3 newidx
M  1.764052  0.400157  0.978738     MM
N  2.240893  1.867558 -0.977278     NN
X  0.950088 -0.151357 -0.103219     XX
Y  0.410599  0.144044  1.454274     YY
Z  0.761038  0.121675  0.443863     ZZ


Let’s make the column “newidx” as the index (we will lose our original index):

mydf.set_index('newidx', inplace=True)

newidx      C1        C2        C3
MM      1.764052  0.400157  0.978738
NN      2.240893  1.867558 -0.977278
XX      0.950088 -0.151357 -0.103219
YY      0.410599  0.144044  1.454274
ZZ      0.761038  0.121675  0.443863


Pandas Common Methods and attributes

df.info

mydf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, M to Z
Data columns (total 3 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   C1      5 non-null      float64
1   C2      5 non-null      float64
2   C3      5 non-null      float64
dtypes: float64(3)
memory usage: 160.0+ bytes


df.dtypes

mydf.dtypes

C1    float64
C2    float64
C3    float64
dtype: object


df.describe()

mydf.describe()

	C1	  C2	     C3
count	5.000000	5.000000	5.000000
mean	1.225334	0.476415	0.359276
std	0.754437	0.801795	0.947389
min	0.410599	-0.151357	-0.977278
25%	0.761038	0.121675	-0.103219
50%	0.950088	0.144044	0.443863
75%	1.764052	0.400157	0.978738
max	2.240893	1.867558	1.454274


Apply (apply) method

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

def square(number):
return number**2

print(mydf['C1'].apply(square))

C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

M    3.111881
N    5.021602
X    0.902668
Y    0.168591
Z    0.579178
Name: C1, dtype: float64


Sort DataFrame by columns

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

print(mydf.sort_values(by='C2'))

C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

C1        C2        C3
X  0.950088 -0.151357 -0.103219
Z  0.761038  0.121675  0.443863
Y  0.410599  0.144044  1.454274
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278


Pandas Groupby Operations

Sometimes in Pandas dataframe we want to combine some columns based on some criteria. Groupby method can do this task elegantly by performing “split”, “apply”, and “combine” operations under the hood. It will be more clear with some examples.

np.random.seed(0)
mymat = np.random.randn(10,3)
mydf = pd.DataFrame(data=mymat, columns="C1 C2 C3".split())
mydf['names'] = "M N X Y Z".split() * 2
print(mydf)

	C1        C2        C3   names
0  1.764052  0.400157  0.978738     M
1  2.240893  1.867558 -0.977278     N
2  0.950088 -0.151357 -0.103219     X
3  0.410599  0.144044  1.454274     Y
4  0.761038  0.121675  0.443863     Z
5  0.333674  1.494079 -0.205158     M
6  0.313068 -0.854096 -2.552990     N
7  0.653619  0.864436 -0.742165     X
8  2.269755 -1.454366  0.045759     Y
9 -0.187184  1.532779  1.469359     Z

mydf.groupby('names').mean()

names        C1        C2        C3
M      1.048863  0.947118  0.386790
N      1.276980  0.506731 -1.765134
X      0.801854  0.356539 -0.422692
Y      1.340177 -0.655161  0.750016
Z      0.286927  0.827227  0.956611


We can replace the “mean” method with any other aggregate method such as sum, max, std, etc.

Another example with the “describe” method:

print(mydf.groupby('names').describe().transpose())

names            M         N         X         Y         Z
C1 count  2.000000  2.000000  2.000000  2.000000  2.000000
mean   1.048863  1.276980  0.801854  1.340177  0.286927
std    1.011430  1.363178  0.209636  1.314622  0.670494
min    0.333674  0.313068  0.653619  0.410599 -0.187184
25%    0.691269  0.795024  0.727736  0.875388  0.049872
50%    1.048863  1.276980  0.801854  1.340177  0.286927
75%    1.406458  1.758937  0.875971  1.804966  0.523982
max    1.764052  2.240893  0.950088  2.269755  0.761038

C2 count  2.000000  2.000000  2.000000  2.000000  2.000000
mean   0.947118  0.506731  0.356539 -0.655161  0.827227
std    0.773520  1.924500  0.718274  1.130246  0.997801
min    0.400157 -0.854096 -0.151357 -1.454366  0.121675
25%    0.673638 -0.173682  0.102591 -1.054763  0.474451
50%    0.947118  0.506731  0.356539 -0.655161  0.827227
75%    1.220599  1.187145  0.610488 -0.255559  1.180003
max    1.494079  1.867558  0.864436  0.144044  1.532779

C3 count  2.000000  2.000000  2.000000  2.000000  2.000000
mean   0.386790 -1.765134 -0.422692  0.750016  0.956611
std    0.837141  1.114197  0.451803  0.995971  0.725135
min   -0.205158 -2.552990 -0.742165  0.045759  0.443863
25%    0.090816 -2.159062 -0.582428  0.397887  0.700237
50%    0.386790 -1.765134 -0.422692  0.750016  0.956611
75%    0.682764 -1.371206 -0.262955  1.102145  1.212985
max    0.978738 -0.977278 -0.103219  1.454274  1.469359


Read and Write Data with Pandas

For details on how to read and write tabular data with the help of Pandas, please refer to the Pandas documentation here.

The most common ones you would like to familiarize yourself are - read_csv, to_csv,

Let us start by writing our dataframe into a file myexample.csv

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

mydf.to_csv('myexample.csv', index=False)

cat myexample.csv

C1,C2,C3
1.764052345967664,0.4001572083672233,0.9787379841057392
2.240893199201458,1.8675579901499675,-0.977277879876411
0.9500884175255894,-0.1513572082976979,-0.10321885179355784
0.41059850193837233,0.144043571160878,1.454273506962975
0.7610377251469934,0.12167501649282841,0.44386323274542566


Now, lets read that file using the read_csv

mydf2 = pd.read_csv('myexample.csv')

	C1        C2        C3
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863


Tags:

Categories:

Created on:

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.