Data Structures in Python
Strings, integers, and Booleans are some of the most basic and simple Python object types regularly used.
Collection
A Collection is a grouping of multiple values together in python. They are container data types. They vary by how they store values and can be indexed or modified.
By default, Python comes with several collection object types:
List
Dictionary
Set
Tuple
One of the most commonly & extensively used collection type is "List". We can put values into a list by separating each entry with commas and placing the results inside square brackets. Let's consider the following example:
In [1]: my_list = [7, 8, 9, 1]
my_list
Out [1]: [7, 8, 9, 1]
This object contains all integers but is a list.
In [2]: type(my_list)
Out [2]: list
We can include all different sorts of data inside a list.
In [3]: nested_list = [1, 2, 3, ['Boo!', True]]
type(nested_list)
Out [3]: list
NumPy Arrays
As the name suggests, NumPy is a module for numerical computing and has been instrumental to Python's popularity as an analytics tool. NumPy array is a collection of data with all items of the same type and can store data up to any number, or 'n' dimensions.
Let's understand more about this data type by focusing on a one-dimensional array and converting our first one from a list using the array () function:
In [4]: my_array = numpy.array ([6, 2, 5,1])
my_array
Out [4]: array ([6, 2, 5, 1])
A numpy array looks a lot like a list but it is a different data type. It's an ndarray, or n-dimensional array. Since it is a different data type, it behaves differently with operations.
Let's see what happens when we multiply a numpy array and also a list.
List | NumPy Array |
In [5]: my_list * 2 | In [6]: my_array * 2 |
Out [5]: [7, 8, 9, 1, 7, 8, 9, 1] | Out [6]: [12, 4, 10, 2] |
This behavior is similar to an Excel range or an R vector. Like R vectors, numpy arrays coerce data to be of the same type:
In [7]:my_coerced_array = numpy.array([1,2,3,'Boo!'])
my_coerced_array
Out[7]: array(['1','2','3','Boo!'], dtype='<U11')
Indexing and Subsetting NumPy Arrays
Let's learn "indexing": how to pull individual items from a numpy array. We can extract items by affixing its index number in square brackets directly next to the object name.
In[8]: # Get fourth element...
my_array[3]
Out[8]: 1
Computers often start counting at zero. This is called zero-based indexing. Python implements zero-based indexing.
Now, let's move on to subsetting a selection of consecutive values, called "slicing" in Python. Let's find the second through fourth elements.
In [9]: # Get second through fourth elements.
my_array[1:3]
Out[9]: array([2,5])
As you can see, slicing is exclusive of the ending element. To get all the three elements, we need to "add 1" to the second number:
In[10]: # Get second through fourth elements
my_array[1:4]
Out[10]: array([2,5,1])
Two-dimensional numpy arrays can serve as a tabular Python data structure, but all elements must be of the same data type. But when we are analyzing data, rarely all elements are of the same type. So, to overcome this drawback we use "pandas".
Pandas DataFrames
Pandas, named after the "panel data" of econometrics, is used for manipulating and analyzing tabular data. It comes installed with Anaconda and a typically used alias is "pd".
In[11]: import pandas as pd
Pandas includes, a one-dimensional data structure called a "Series" and a two-dimensional structure called "DataFrame". DataFrame is the most widely used structure. DataFrame can be created from other data types, including numpy arrays, using the DataFrame function.
In[12]: record_1 = np.array(['An', 55, False])
record_2 = np.array(['Amy', 65, True])
record_3 = np.array(['Lee', 45, False])
record_4 = np.array(['John', 68, False])
record_1 = np.array(['Ben', 52, True])
roster = pd.DataFrame(data = [record_1, record_2, record_3], columns = ['name', 'height','injury'])
Out[12]:
name height injury
0 An 55 False
1 Amy 65 True
2 Lee 45 False
DataFrames generally include named labels for each column and an index running down the rows.
Indexing and Subsetting DataFrames
To index a DataFrame, we can use iloc (integer location) method. Here, we need to index by both row and column.
In[13]: #First row, first column of DataFrame
roster.iloc[0,0]
Out[13]: 'An'
To slice a DataFrame and capture multiple rows and columns:
In[14]: #Second through fourth rows, first through third column of DataFrame
roster.iloc[1:4,0:3]
Out[14]:
name height injury
1 Amy 65 True
2 Lee 45 False
3 John 68 False
To index an entire column by name, we can use the related loc method.
In[15]: #Select all rows in the name column
roster.loc[:,'name']
Out[15]:
0 An
1 Amy
2 Lee
3 John
4 Ben
Name: name, dtype: object
Writing DataFrames
Pandas also includes functions to write DataFrames to both .csv files and .xlsx workbooks with the write_csv() and write_xlsx() methods.
In[16]: roster.to_csv('output/roster-output-python.csv')
roster.to_excel('output/roster-output-python.xlsx')