How to extract station data from the web page

Utpal Kumar   1 minute read      

Uses pandas to read the html page and extract the html data into pandas dataframe

In this post, we will use python to extract the station information from a web page. This can save a lot of time in comparison to manually copying the data. We will obtain the data into pandas dataframe and to save it into csv file.

Tutorial Video

Extract table data into pandas dataframe

import pandas as pd

url = "https://bats.earth.sinica.edu.tw/Station/BATS_Stn_Summary.html"

htmlPage = pd.read_html(url)

# print(htmlPage)
print(f"total # of tables {len(htmlPage)}")


df = htmlPage[1]
columns = df.iloc[1, :].values
print(columns)

dict_list = []
for idx in range(2, df.shape[0]-3):
    _dict = {}
    for icol, col in enumerate(columns):
        _dict.update({col: df.iloc[idx, icol]})
    dict_list.append(_dict)

new_df = pd.DataFrame(dict_list)
print(new_df.head())

# save into csv file
new_df.to_csv('station_list.txt', index=False)

Plot stations

import numpy as np
import pygmt
import pandas as pd
np.random.seed(45)  # to get the same color at each run

df = pd.read_csv('station_list.txt')
print(df.head())

# get the list of networks
networks = list(set(df['Network'].tolist()))

dfs = []
for net in networks:
    df1 = df[df['Network'] == net]
    dfs.append(df1)

colorsList = []
for i in range(len(networks)):
    colorsList.append('#%06X' % np.random.randint(0, 0xFFFFFF))


minlon, maxlon = df['Long.'].min()-1, df['Long.'].max()+1
minlat, maxlat = df['Lat.'].min()-1, df['Lat.'].max()+1

# define etopo data file

topo_data = "@earth_relief_15s"

# Visualization
fig = pygmt.Figure()
# make color pallets
pygmt.makecpt(
    cmap='etopo1',
    series='-8000/5000/1000',
    continuous=True
)

# plot high res topography
fig.grdimage(
    grid=topo_data,
    region=[minlon, maxlon, minlat, maxlat],
    projection='M4i',
    shading=True,
    frame=True
)

# plot coastlines
fig.coast(
    region=[minlon, maxlon, minlat, maxlat],
    projection='M4i',
    shorelines=True,
    frame=True
)
leftjustify, rightoffset = "TL", "5p/-5p"
for idx, dff in enumerate(dfs):
    fig.plot(
        x=dff["Long."].values,
        y=dff["Lat."].values,
        style="i10p",
        color=colorsList[idx],
        pen="black",
        label=networks[idx]
    )

for snum in range(df.shape[0]):
    fig.text(
        x=df.loc[snum, 'Long.'],
        y=df.loc[snum, 'Lat.'],
        text=f"{df.loc[snum, 'Station']}",
        justify=leftjustify,
        angle=0,
        offset=rightoffset,
        fill="white",
        font=f"6p,Helvetica-Bold,black",
    )


fig.legend(position="JTR+jTR+o0.2c", box=True)

fig.savefig('station_map.png', crop=True, dpi=300)
Extracted stations from html page
Extracted stations from html page

Disclaimer of liability

The information provided by the Earth Inversion is made available for educational purposes only.

Whilst we endeavor to keep the information up-to-date and correct. Earth Inversion makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services or related graphics content on the website for any purpose.

UNDER NO CIRCUMSTANCE SHALL WE HAVE ANY LIABILITY TO YOU FOR ANY LOSS OR DAMAGE OF ANY KIND INCURRED AS A RESULT OF THE USE OF THE SITE OR RELIANCE ON ANY INFORMATION PROVIDED ON THE SITE. ANY RELIANCE YOU PLACED ON SUCH MATERIAL IS THEREFORE STRICTLY AT YOUR OWN RISK.


Leave a comment