Pandas: Cute and fluffy code!

·

9 min read

Sometimes, when we have a lot of data, or we want to import some data from somewhere else to use it ourselves, lists and dictionaries just don't cut it.

They look clunky and ugly. I think we should learn something that can make them look better and also allow us to do fun stuff with them, like plotting them on graphs and such!

Pandas is a library in Python that allows us to do this. To start with, we'll import our pandas library into our python installation so we can play around with it:

pip install pandas

Now we have it! If we open up a new file, we can start the file off by importing our library:

import pandas as pd

You don't have to import it as pd, but it'll help as a lot of future tutorials you'll see elsewhere on the internet will use pd as the way to reference pandas, so it's handy for us to do it like this too.

So we have pandas, and we've imported it. Now we need some data to play around with. You can copy and paste this code to save you from having to type it yourself, so we can take a look at it.

import pandas as pd

# create a dictionary of data
data = {'name': ['John', 'Jane', 'Jim', 'Joan'],
        'age': [32, 28, 41, 35],
        'country': ['USA', 'UK', 'Canada', 'Australia']}

# create a pandas data frame from the dictionary
df = pd.DataFrame(data)

# view the data frame
print(df)

So. We create a dictionary of data, this should be fairly obvious to visualise in our minds.

We then create what's called a DataFrame. This is how pandas holds onto the data. It's like a little box the panda uses to hold all our data.

Our final command just prints the DataFrame out for us.

   name  age   country
0  John   32        USA
1  Jane   28         UK
2   Jim   41     Canada
3  Joan   35  Australia

You'll notice how it also makes it look nice and tidy. It has created the first column for us, which is how pandas indexes the data. It's how we will refer to it in future if we need to work with it by index.

What can we do with the data?

One thing we can do is sort it by Age, if we wanted to for any reason. The way to do this would be:

df.sort_values(by='Age', ascending=True, inplace=True)

Pretty simple, we could do this in descending order too.

You'll notice a command inplace=True there. This is done because if we didn't specify that, pandas would want to create a new dataframe with the edited data. We just want to edit the one we have.

Now, the data looks like this:

   Name  Age    City
2   Jim   28   Paris
0  John   32  New York
1  Jane   45   London
3  Joan   51   Berlin

You'll notice the indexes are all mixed up though! That's because as far as pandas is concerned, it still wants to remember the index of the items you put in. If we wanted to get back to where we were before, we would do this:

df.sort_index(ascending=True, inplace=True)

And that's it! I love how Python most commonly has commands that make sense how you use them.

CSV files

Let's use it on something a bit more complex so we can see how powerful it is.

When you commonly have spreadsheets of data to work through, that you have created yourself or that you have exported from somewhere else, you can use Pandas to process that data yourself, so you know how its being used.

CSV files are Comma Seperated Values files that, when opened in a text editor, look like this:

It's as it says - they are values separated by commas. Let's now do something with this.

I saved this file on my PC as student.csv. Once done, I put it in the same directory as my program, and then used pandas to import that CSV file like this:

import pandas as pd

# read a CSV file and store it as a data frame
df = pd.read_csv("student.csv")

# view the data frame
print(df)

When it's printed on that bottom line, it'll look like this:

    id         name  class  mark  gender
0    1     John Deo   Four    75  female
1    2     Max Ruin  Three    85    male
2    3       Arnold  Three    55    male
3    4   Krish Star   Four    60  female
4    5    John Mike   Four    60  female
5    6    Alex John   Four    55    male
6    7  My John Rob  Fifth    78    male
7    8       Asruid   Five    85    male
8    9      Tes Qry    Six    78    male
9   10     Big John   Four    55  female
10  11       Ronald    Six    89  female
11  12        Recky    Six    94  female
12  13          Kty  Seven    88  female
13  14         Bigy  Seven    88  female
14  15     Tade Row   Four    88    male
15  16        Gimmy   Four    88    male
16  17        Tumyu    Six    54    male
17  18        Honny   Five    75    male
18  19        Tinny   Nine    18    male
19  20       Jackly   Nine    65  female
20  21   Babby John   Four    69  female
21  22       Reggid  Seven    55  female
22  23        Herod  Eight    79    male
23  24    Tiddy Now  Seven    78    male
24  25     Giff Tow  Seven    88    male
25  26       Crelea  Seven    79    male
26  27     Big Nose  Three    81  female
27  28    Rojj Base  Seven    86  female
28  29  Tess Played  Seven    55    male
29  30    Reppy Red    Six    79  female
30  31  Marry Toeey   Four    88    male
31  32    Binn Rott  Seven    90  female
32  33    Kenn Rein    Six    96  female
33  34     Gain Toe  Seven    69    male
34  35   Rows Noump    Six    88  female

As before, we can use the power of the Panda to sort this data for us - I want to sort this by gender, so I can see all the males together and all the females together.

df.sort_values(by="gender", inplace=True)

And the result:

    id         name  class  mark  gender
0    1     John Deo   Four    75  female
32  33    Kenn Rein    Six    96  female
31  32    Binn Rott  Seven    90  female
29  30    Reppy Red    Six    79  female
27  28    Rojj Base  Seven    86  female
26  27     Big Nose  Three    81  female
21  22       Reggid  Seven    55  female
20  21   Babby John   Four    69  female
19  20       Jackly   Nine    65  female
13  14         Bigy  Seven    88  female
12  13          Kty  Seven    88  female
34  35   Rows Noump    Six    88  female
10  11       Ronald    Six    89  female
9   10     Big John   Four    55  female
3    4   Krish Star   Four    60  female
4    5    John Mike   Four    60  female
11  12        Recky    Six    94  female
1    2     Max Ruin  Three    85    male
2    3       Arnold  Three    55    male
30  31  Marry Toeey   Four    88    male
28  29  Tess Played  Seven    55    male
5    6    Alex John   Four    55    male
25  26       Crelea  Seven    79    male
24  25     Giff Tow  Seven    88    male
23  24    Tiddy Now  Seven    78    male
6    7  My John Rob  Fifth    78    male
7    8       Asruid   Five    85    male
8    9      Tes Qry    Six    78    male
18  19        Tinny   Nine    18    male
33  34     Gain Toe  Seven    69    male
16  17        Tumyu    Six    54    male
15  16        Gimmy   Four    88    male
14  15     Tade Row   Four    88    male
22  23        Herod  Eight    79    male
17  18        Honny   Five    75    male

They are sorted female first because female is first in the alphabet. Likewise, we can sort by mark too, in descending order, to see who got the highest and lowest:

df.sort_values(by="mark", ascending=False, inplace=True)

Result:

    id         name  class  mark  gender
32  33    Kenn Rein    Six    96  female
11  12        Recky    Six    94  female
31  32    Binn Rott  Seven    90  female
10  11       Ronald    Six    89  female
12  13          Kty  Seven    88  female
13  14         Bigy  Seven    88  female
30  31  Marry Toeey   Four    88    male
34  35   Rows Noump    Six    88  female
24  25     Giff Tow  Seven    88    male
15  16        Gimmy   Four    88    male
14  15     Tade Row   Four    88    male
27  28    Rojj Base  Seven    86  female
7    8       Asruid   Five    85    male
1    2     Max Ruin  Three    85    male
26  27     Big Nose  Three    81  female
25  26       Crelea  Seven    79    male
22  23        Herod  Eight    79    male
29  30    Reppy Red    Six    79  female
8    9      Tes Qry    Six    78    male
6    7  My John Rob  Fifth    78    male
23  24    Tiddy Now  Seven    78    male
0    1     John Deo   Four    75  female
17  18        Honny   Five    75    male
33  34     Gain Toe  Seven    69    male
20  21   Babby John   Four    69  female
19  20       Jackly   Nine    65  female
4    5    John Mike   Four    60  female
3    4   Krish Star   Four    60  female
5    6    Alex John   Four    55    male
28  29  Tess Played  Seven    55    male
2    3       Arnold  Three    55    male
9   10     Big John   Four    55  female
21  22       Reggid  Seven    55  female
16  17        Tumyu    Six    54    male
18  19        Tinny   Nine    18    male

Okay, so that's all well and good, but it's not pretty is it?

Well, no. It's still a bit boring. Let's make some nice graphs of it all so the panda is happy!

I downloaded a CSV file from a datasets website (comment if you'd like to know from where!) with the Global Temperatures of the planet for the last roughly 270 years.

I then imported it, like before, as such:

import pandas as pd

df = pd.read_csv('temps.csv')

I won't print out the entire file - it's roughly 3200 rows. All I'm interested in is graphing this data to see how it looks!

First though, we need to add a new library as pandas can't plot data. Type the following in your command prompt/terminal to install a library called matplotlib:

pip install matplotlib

Now, we can use matplotlib like we used pandas:

import matplotlib.pyplot as plt

As before, rather than typing out that complex matplotlib.pyplot every time, we use plt for our ease and, again, because that's the general way it's done online.

I then added this code to the file:

plt.plot(df['dt'], df['LandAverageTemperature'])

plt.title('Average Land Temperature')

plt.xlabel('Year')
plt.ylabel('Temperature (Celsius)')

plt.show()

So, in this bit of code, we:

  • Tell matplotlib the 2 columns to plot - 'dt' is our Year, and 'LandAverageTemperature' is the temperature on that year.

  • Add a title

  • Add an x (horizontal) label of 'Year'

  • Add a y (vertical) label of 'Temperature (Celsius)'

  • Show the plot

Everything should now have worked and you'll see a pretty crazy jagged line on your screen like this:

It looks so messy because we have 3200 plots here - smaller dataframes wouldn't use so many plots but this is just an example.

Phew - that was a lot! Now to summarise:

  • Pandas is a useful library to arrange our complex data, manually that we put in, or through a CSV file, into a nice looking box of data to play around with

  • We know how to sort the data by certain columns

  • We imported matplotlib to actually plot our data to help us visualise it easier.

  • We plotted the data and could label the created plot so we can understand the visualisation.

I hope you all had fun learning this! It was a lot so don't worry if you need to read through it over and over.

Please comment if you need any further advice or anything, I love seeing your comments and thank you to all the subscribers! I hope to bring you constant content to keep you all learning. Remember - the more we learn, the more capable we are to learn.