top of page
Anitha

Data Structures in Python

Strings, integers, and Booleans are some of the most basic and simple Python object types regularly used.


Collection

A Collection is a grouping of multiple values together in python. They are container data types. They vary by how they store values and can be indexed or modified.

By default, Python comes with several collection object types:

  • List

  • Dictionary

  • Set

  • Tuple

One of the most commonly & extensively used collection type is "List". We can put values into a list by separating each entry with commas and placing the results inside square brackets. Let's consider the following example:

In [1]: my_list = [7, 8, 9, 1]
        my_list
Out [1]: [7, 8, 9, 1]

This object contains all integers but is a list.

In [2]: type(my_list)
Out [2]: list

We can include all different sorts of data inside a list.

In [3]: nested_list = [1, 2, 3, ['Boo!', True]]
type(nested_list)
Out [3]: list

NumPy Arrays

As the name suggests, NumPy is a module for numerical computing and has been instrumental to Python's popularity as an analytics tool. NumPy array is a collection of data with all items of the same type and can store data up to any number, or 'n' dimensions.

Let's understand more about this data type by focusing on a one-dimensional array and converting our first one from a list using the array () function:


In [4]: my_array = numpy.array ([6, 2, 5,1])
                   my_array
Out [4]: array ([6, 2, 5, 1])

A numpy array looks a lot like a list but it is a different data type. It's an ndarray, or n-dimensional array. Since it is a different data type, it behaves differently with operations.

Let's see what happens when we multiply a numpy array and also a list.

​List

NumPy Array

In [5]: my_list * 2

In [6]: my_array * 2

Out [5]: [7, 8, 9, 1, 7, 8, 9, 1]

Out [6]: [12, 4, 10, 2]

This behavior is similar to an Excel range or an R vector. Like R vectors, numpy arrays coerce data to be of the same type:

In [7]:my_coerced_array = numpy.array([1,2,3,'Boo!'])
                          my_coerced_array
Out[7]: array(['1','2','3','Boo!'], dtype='<U11')

Indexing and Subsetting NumPy Arrays

Let's learn "indexing": how to pull individual items from a numpy array. We can extract items by affixing its index number in square brackets directly next to the object name.

In[8]: # Get fourth element...
         my_array[3]
Out[8]: 1

Computers often start counting at zero. This is called zero-based indexing. Python implements zero-based indexing.

Now, let's move on to subsetting a selection of consecutive values, called "slicing" in Python. Let's find the second through fourth elements.

In [9]: # Get second through fourth elements.
        my_array[1:3]
Out[9]: array([2,5])

As you can see, slicing is exclusive of the ending element. To get all the three elements, we need to "add 1" to the second number:

In[10]: # Get second through fourth elements
        my_array[1:4]
Out[10]: array([2,5,1])

Two-dimensional numpy arrays can serve as a tabular Python data structure, but all elements must be of the same data type. But when we are analyzing data, rarely all elements are of the same type. So, to overcome this drawback we use "pandas".


Pandas DataFrames

Pandas, named after the "panel data" of econometrics, is used for manipulating and analyzing tabular data. It comes installed with Anaconda and a typically used alias is "pd".

In[11]: import pandas as pd

Pandas includes, a one-dimensional data structure called a "Series" and a two-dimensional structure called "DataFrame". DataFrame is the most widely used structure. DataFrame can be created from other data types, including numpy arrays, using the DataFrame function.

In[12]: record_1 = np.array(['An', 55, False])
        record_2 = np.array(['Amy', 65, True])
        record_3 = np.array(['Lee', 45, False])
        record_4 = np.array(['John', 68, False])
        record_1 = np.array(['Ben', 52, True])
 roster = pd.DataFrame(data = [record_1, record_2, record_3], columns = ['name', 'height','injury'])
Out[12]:
               name               height               injury
         0     An                   55                  False
         1     Amy                  65                  True       
         2     Lee                  45                  False

DataFrames generally include named labels for each column and an index running down the rows.


Indexing and Subsetting DataFrames

To index a DataFrame, we can use iloc (integer location) method. Here, we need to index by both row and column.

In[13]: #First row, first column of DataFrame
       roster.iloc[0,0]
Out[13]: 'An'

To slice a DataFrame and capture multiple rows and columns:

In[14]: #Second through fourth rows, first through third column of DataFrame
       roster.iloc[1:4,0:3]
Out[14]:
               name               height               injury
         1     Amy                  65                  True
         2     Lee                  45                  False
         3     John                 68                  False

To index an entire column by name, we can use the related loc method.

In[15]: #Select all rows in the name column
        roster.loc[:,'name']
Out[15]: 
        0        An
        1        Amy
        2        Lee
        3        John
        4        Ben
        Name: name, dtype: object
Writing DataFrames

Pandas also includes functions to write DataFrames to both .csv files and .xlsx workbooks with the write_csv() and write_xlsx() methods.

In[16]: roster.to_csv('output/roster-output-python.csv')
        roster.to_excel('output/roster-output-python.xlsx')

182 views

Recent Posts

See All
bottom of page