How to start using Pandas immediately for Earth Data Analysis (codes included)

Utpal Kumar 9 minute read UTILITIES April 14, 2021

This tutorial gives a brief description of scientific computing using Pandas by introducing Series, DataFrame, Pandas common operations, methods, conditional operations, groupby operations, and reading and writing data

Pandas (stands for Panel Data library) is an open-source library for Python that comes with many tools that make the task of data analysis several folds easier. It is built upon the Numpy Library in Python. For details on Numpy, please check my previous post:

In this tutorial, we will focus on two main data structures Pandas has - Series and DataFrames. We will see how we can perform groupby operations and read/write tabular data directly using Pandas.

Pandas Series

Pandas series is similar to numpy array but it can have ”named index”. Naming indexes helps a lot in data analysis.

Pandas DataFrames

When we combine multiple series with common index, then we get DataFrame.

Create `DataFrame` from random matrix

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat)
print(mydf)

⇒

       1         2
0.955057  0.190794  1.978757
2.605967  0.683509  0.302665
1.693723 -1.706086 -1.159119
-0.134841  0.390528  0.166905
0.184502  0.807706  0.072960

Now, let us try to name the index and columns

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf)

This will return

	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845
Z -1.382920  1.482495  0.961458

Extract Series from Pandas DataFrame

print(mydf['C1'])

⇒

M    0.386030
N    0.230336
X   -0.031160
Y   -0.741790
Z   -1.382920
Name: C1, dtype: float64

columnList = ['C1', 'C2']
print(mydf[columnList])

⇒

	C1        C2
M  0.386030  2.084019
N  0.230336  0.681209
X -0.031160  1.939932
Y -0.741790  0.187125
Z -1.382920  1.482495

Add new column to the existing Pandas Dataframe

Now, let us add a new column to the existing dataframe.

mydf['C12'] = mydf['C1'] + mydf['C2']
print(mydf)

	C1        C2        C3       C12
M  0.386030  2.084019 -0.376519  2.470049
N  0.230336  0.681209  1.035125  0.911546
X -0.031160  1.939932 -1.005187  1.908772
Y -0.741790  0.187125 -0.732845 -0.554665
Z -1.382920  1.482495  0.961458  0.099575

Remove the existing column or index from Pandas Dataframe

mydf.drop('C12', axis=1, inplace=True) #by default axis=0
print(mydf.head())

	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845
Z -1.382920  1.482495  0.961458

Two things to notice here. Pandas by default take the index names to drop and it does not permanently remove the index or column unless specified. If we want to remove the index then we provide axis=0 with the index name.

mydf.drop('Z', inplace=True) #by default axis=0
print(mydf.head())

⇒

	C1        C2        C3
M  0.386030  2.084019 -0.376519
N  0.230336  0.681209  1.035125
X -0.031160  1.939932 -1.005187
Y -0.741790  0.187125 -0.732845

`loc` and `iloc` method to extract values from the Pandas Dataframe

We can use loc or iloc method to extract the row from the DataFrame:

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

print(mydf.loc['M']) #uses the column name

print(mydf.iloc[0]) #uses index

	C1        C2        C3
M -0.971393 -1.522333  1.133703
N  0.528187  0.393461 -0.630507
X -1.398290 -0.219311 -0.045676
Y  0.012421  0.093628  1.240813
Z -1.097693 -1.908009 -0.380104

C1   -0.971393
C2   -1.522333
C3    1.133703
Name: M, dtype: float64

C1   -0.971393
C2   -1.522333
C3    1.133703
Name: M, dtype: float64

We can also extract multiple rows by using a list of rows:

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

print(mydf.loc[['M', 'X']])

print(mydf.iloc[[0, 2]])

	C1        C2        C3
M -1.666059 -2.736995  1.522562
N  0.178009 -0.626805 -0.391089
X  1.743477  1.130018  0.897796
Y  0.330866 -1.063049 -0.125381
Z -0.945588  2.029544 -1.046358

  C1        C2        C3
M -1.666059 -2.736995  1.522562
X  1.743477  1.130018  0.897796

  C1        C2        C3
M -1.666059 -2.736995  1.522562
X  1.743477  1.130018  0.897796

How can we extract the columns?

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

print(mydf.loc[['M', 'N'],'C1'])

print(mydf.iloc[[0, 1],0])

	C1        C2        C3
M -1.549671  0.435253  1.259904
N -0.447898  0.266207  0.412580
X  0.988773  0.513833 -0.928205
Y  0.846904 -0.298436  0.029141
Z  0.889031 -1.839261  0.863596

M   -1.549671
N   -0.447898
Name: C1, dtype: float64

M   -1.549671
N   -0.447898
Name: C1, dtype: float64

Conditional Operations on Pandas DataFrames

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

	C1        C2        C3
M  1.035125 -0.031160  1.939932
N -1.005187 -0.741790  0.187125
X -0.732845 -1.382920  1.482495
Y  0.961458 -2.141212  0.992573
Z  1.192241 -1.046780  1.292765

Now, we want to find all the values in the DataFrame that are positive and mask all others. This can be easily done using:

mydf = mydf[mydf>0]
print(mydf.head())

	C1  C2        C3
M  1.035125 NaN  1.939932
N       NaN NaN  0.187125
X       NaN NaN  1.482495
Y  0.961458 NaN  0.992573
Z  1.192241 NaN  1.292765

We can do the same thing for just a particular column:

mydf = mydf[mydf['C2']>0]
print(mydf.head())

	C1        C2        C3
M  0.649148  0.358941 -1.080471
N  0.902398  0.161781  0.833029
Y -0.708954  0.586847 -1.621348
Z  0.677535  0.026105 -1.678284

We can use the value_counts() method (for series) for counting the number of C2>0. There are many other ways we can retrieve such information.

We can also use more than one condition. Please note that in this case I regenerated the mydf using the numpy seed of 0.

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

  C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

mydf2 = mydf[(mydf['C1']>0) & (mydf['C2']>1)]
print(mydf2.head())

			C1        C2        C3
N  2.240893  1.867558 -0.977278

Pandas conditioning operations uses “&” and “

”.

Create new index for the dataframe

mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

	C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

Now, we create a new column with new index values

my_new_idx = "MM NN XX YY ZZ".split()
mydf['newidx'] = my_new_idx
print(mydf.head())

	C1        C2        C3 newidx
M  1.764052  0.400157  0.978738     MM
N  2.240893  1.867558 -0.977278     NN
X  0.950088 -0.151357 -0.103219     XX
Y  0.410599  0.144044  1.454274     YY
Z  0.761038  0.121675  0.443863     ZZ

Let’s make the column “newidx” as the index (we will lose our original index):

mydf.set_index('newidx', inplace=True)
print(mydf.head())

newidx      C1        C2        C3                              
MM      1.764052  0.400157  0.978738
NN      2.240893  1.867558 -0.977278
XX      0.950088 -0.151357 -0.103219
YY      0.410599  0.144044  1.454274
ZZ      0.761038  0.121675  0.443863

Pandas Common Methods and attributes

`df.info`

mydf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, M to Z
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   C1      5 non-null      float64
 1   C2      5 non-null      float64
 2   C3      5 non-null      float64
dtypes: float64(3)
memory usage: 160.0+ bytes

`df.dtypes`

mydf.dtypes

C1    float64
C2    float64
C3    float64
dtype: object

`df.describe()`

mydf.describe()

	C1	  C2	     C3
count	5.000000	5.000000	5.000000
mean	1.225334	0.476415	0.359276
std	0.754437	0.801795	0.947389
min	0.410599	-0.151357	-0.977278
25%	0.761038	0.121675	-0.103219
50%	0.950088	0.144044	0.443863
75%	1.764052	0.400157	0.978738
max	2.240893	1.867558	1.454274

Apply (`apply`) method

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

def square(number):
    return number**2

print(mydf['C1'].apply(square))

C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

M    3.111881
N    5.021602
X    0.902668
Y    0.168591
Z    0.579178
Name: C1, dtype: float64

Sort DataFrame by columns

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())
print(mydf.head())

print(mydf.sort_values(by='C2'))

C1        C2        C3
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278
X  0.950088 -0.151357 -0.103219
Y  0.410599  0.144044  1.454274
Z  0.761038  0.121675  0.443863

  C1        C2        C3
X  0.950088 -0.151357 -0.103219
Z  0.761038  0.121675  0.443863
Y  0.410599  0.144044  1.454274
M  1.764052  0.400157  0.978738
N  2.240893  1.867558 -0.977278

Pandas Groupby Operations

Sometimes in Pandas dataframe we want to combine some columns based on some criteria. Groupby method can do this task elegantly by performing “split”, “apply”, and “combine” operations under the hood. It will be more clear with some examples.

np.random.seed(0)
mymat = np.random.randn(10,3)
mydf = pd.DataFrame(data=mymat, columns="C1 C2 C3".split())
mydf['names'] = "M N X Y Z".split() * 2
print(mydf)

	C1        C2        C3   names
1.764052  0.400157  0.978738     M
2.240893  1.867558 -0.977278     N
0.950088 -0.151357 -0.103219     X
0.410599  0.144044  1.454274     Y
0.761038  0.121675  0.443863     Z
0.333674  1.494079 -0.205158     M
0.313068 -0.854096 -2.552990     N
0.653619  0.864436 -0.742165     X
2.269755 -1.454366  0.045759     Y
-0.187184  1.532779  1.469359     Z

mydf.groupby('names').mean()

names        C1        C2        C3                      
M      1.048863  0.947118  0.386790
N      1.276980  0.506731 -1.765134
X      0.801854  0.356539 -0.422692
Y      1.340177 -0.655161  0.750016
Z      0.286927  0.827227  0.956611

We can replace the “mean” method with any other aggregate method such as sum, max, std, etc.

Another example with the “describe” method:

print(mydf.groupby('names').describe().transpose())

names            M         N         X         Y         Z
C1 count  2.000000  2.000000  2.000000  2.000000  2.000000
   mean   1.048863  1.276980  0.801854  1.340177  0.286927
   std    1.011430  1.363178  0.209636  1.314622  0.670494
   min    0.333674  0.313068  0.653619  0.410599 -0.187184
   25%    0.691269  0.795024  0.727736  0.875388  0.049872
   50%    1.048863  1.276980  0.801854  1.340177  0.286927
   75%    1.406458  1.758937  0.875971  1.804966  0.523982
   max    1.764052  2.240893  0.950088  2.269755  0.761038

C2 count  2.000000  2.000000  2.000000  2.000000  2.000000
   mean   0.947118  0.506731  0.356539 -0.655161  0.827227
   std    0.773520  1.924500  0.718274  1.130246  0.997801
   min    0.400157 -0.854096 -0.151357 -1.454366  0.121675
   25%    0.673638 -0.173682  0.102591 -1.054763  0.474451
   50%    0.947118  0.506731  0.356539 -0.655161  0.827227
   75%    1.220599  1.187145  0.610488 -0.255559  1.180003
   max    1.494079  1.867558  0.864436  0.144044  1.532779

C3 count  2.000000  2.000000  2.000000  2.000000  2.000000
   mean   0.386790 -1.765134 -0.422692  0.750016  0.956611
   std    0.837141  1.114197  0.451803  0.995971  0.725135
   min   -0.205158 -2.552990 -0.742165  0.045759  0.443863
   25%    0.090816 -2.159062 -0.582428  0.397887  0.700237
   50%    0.386790 -1.765134 -0.422692  0.750016  0.956611
   75%    0.682764 -1.371206 -0.262955  1.102145  1.212985
   max    0.978738 -0.977278 -0.103219  1.454274  1.469359

Read and Write Data with Pandas

For details on how to read and write tabular data with the help of Pandas, please refer to the Pandas documentation here.

The most common ones you would like to familiarize yourself are - read_csv, to_csv,

Let us start by writing our dataframe into a file myexample.csv

np.random.seed(0)
mymat = np.random.randn(5,3)
mydf = pd.DataFrame(data=mymat, index="M N X Y Z".split(), columns="C1 C2 C3".split())

mydf.to_csv('myexample.csv', index=False)

cat myexample.csv

C1,C2,C3
764052345967664,0.4001572083672233,0.9787379841057392
240893199201458,1.8675579901499675,-0.977277879876411
9500884175255894,-0.1513572082976979,-0.10321885179355784
41059850193837233,0.144043571160878,1.454273506962975
7610377251469934,0.12167501649282841,0.44386323274542566

Now, lets read that file using the read_csv

mydf2 = pd.read_csv('myexample.csv')
print(mydf2.head())

	C1        C2        C3
1.764052  0.400157  0.978738
2.240893  1.867558 -0.977278
0.950088 -0.151357 -0.103219
0.410599  0.144044  1.454274
0.761038  0.121675  0.443863

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.

How to start using Pandas immediately for Earth Data Analysis (codes included)

Pandas Series

Similar posts

Understanding the metropolis-hastings algorithm: a step-by-step guide with python examples

Exploring the multi-taper method: a robust approach for time-frequency analysis

What is pink noise and how is it different from white noise?

Object oriented programming (oop) with examples in python

Using mpi4py for parallel computing in python on supercomputers

Create Pandas Series: Method 1 (detailed way)

Create Pandas Series: Method 2

Extract information from Pandas Series

Operations on Pandas Series

Pandas DataFrames

Create `DataFrame` from random matrix

Extract Series from Pandas DataFrame

Add new column to the existing Pandas Dataframe

Remove the existing column or index from Pandas Dataframe

`loc` and `iloc` method to extract values from the Pandas Dataframe

Conditional Operations on Pandas DataFrames

Create new index for the dataframe

Pandas Common Methods and attributes

`df.info`

`df.dtypes`

`df.describe()`

Apply (`apply`) method

Sort DataFrame by columns

Pandas Groupby Operations

Read and Write Data with Pandas

Understanding the metropolis-hastings algorithm: a step-by-step guide with python examples

Exploring the multi-taper method: a robust approach for time-frequency analysis

What is pink noise and how is it different from white noise?

Object oriented programming (oop) with examples in python

Disclaimer of liability

Leave a comment

Subscribe to our weekly newsletter

Pandas Series

Similar posts

Create Pandas Series: Method 1 (detailed way)

Create Pandas Series: Method 2

Extract information from Pandas Series

Operations on Pandas Series

Pandas DataFrames

Create DataFrame from random matrix

Extract Series from Pandas DataFrame

Add new column to the existing Pandas Dataframe

Remove the existing column or index from Pandas Dataframe

loc and iloc method to extract values from the Pandas Dataframe

Conditional Operations on Pandas DataFrames

Create new index for the dataframe

Pandas Common Methods and attributes

df.info

df.dtypes

df.describe()

Apply (apply) method

Sort DataFrame by columns

Pandas Groupby Operations

Read and Write Data with Pandas

Disclaimer of liability

Leave a comment

Create `DataFrame` from random matrix

`loc` and `iloc` method to extract values from the Pandas Dataframe

`df.info`

`df.dtypes`

`df.describe()`

Apply (`apply`) method