# Python Modules, Strings, and Data Science Packages

Learning Objectives:
* Students will learn various means of importing Python modules and functions.
* Students will learn intermediate String operations and be introduced to regular expression pattern matching.
* Students will overview the Python packages used most commonly by Data Scientists.

Readings before class:

* Jake VanderPlas.  [A Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython) sections:
  * [13 - Modules and Packages](https://github.com/jakevdp/WhirlwindTourOfPython/blob/master/13-Modules-and-Packages.ipynb)
  * [14 - String Manipulation and Regular Expressions](https://github.com/jakevdp/WhirlwindTourOfPython/blob/master/14-Strings-and-Regular-Expressions.ipynb) _We could spend multiple classes on regular expressions alone, but deep coverage is beyond the scope of this course.  However, regular expressions are very useful, so getting practice with Python regular expressions would be a good personal study goal beyond this course.  Reference resources for regular expressions are listed below._
  * [15 - A Preview of Data Science Tools](https://github.com/jakevdp/WhirlwindTourOfPython/blob/master/15-Preview-of-Data-Science-Tools.ipynb) _Skim this chapter.  We will cover these in greater depth in the weeks to come._
* Allen B. Downey.  [Think Python 2e](https://greenteapress.com/wp/think-python-2e/):
  * Review [Chapter 8  Strings](http://greenteapress.com/thinkpython2/html/thinkpython2009.html)

Optional reference:
* W3School's [RegEx (Regular Expression) tutorial](https://www.w3schools.com/python/python_regex.asp)
* Al Sweigart's [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/)
  * [Chapter 6 - Manipulating Strings](https://automatetheboringstuff.com/2e/chapter6/)
  * [Chapter 7 - Pattern Matching with Regular Expressions](https://automatetheboringstuff.com/2e/chapter7/)
* A.M. Kuchling's [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)

Activities before class:
* Read below up to (but not including) the section marked Homework.  **Be sure to do the pre-class exercises as you do your reading on strings and regular expressions.** You are encouraged to add code blocks and play with the forms to gain understanding and comfort with them.

In class:
* We will work together in class on the section labeled "In Class"

Homework after class:
* Complete the section labeled "Homework" below before the next class when it will be collected.

## Python Modules and Packages

A Python _module_ is a file that contains Python code and thus can include function definitions one may want to import and use.  A Python _package_ is a collection of Python _modules_ together with an ```__init__.py``` file that distinguishes which definitions in those modules are visible to those who import the package.  There are a few forms of imports that you will want to become familiar with:

In [1]:
import math  # allows you to use definitions of package math by preceding their names with "math."
print(math.pi)

from math import sqrt, isclose  # allows you to use sqrt, isclose of package math without needing to preceded it with "math."
x = sqrt(2) ** 2
print(x, x == 2, isclose(x, 2))

import math as m  # allows you to use definitions of package math by preceding their names with "m."
print(m.sin(m.pi))

# from math import *  <-- WARNING: This allows you to use all definitions of package math without prefix.
#   This is generally discouraged, because it can have unintended consequences if package math has a
#   function you didn't consider or know about that overrides a function you had of the same name.

3.141592653589793
2.0000000000000004 False True
1.2246467991473532e-16


## String Manipulation

Most string operations are straight-forward and well illustrated in the assigned VanderPlas reading. Here, in-class, and in the homework, you will exercise these operations.

**To-do: Complete the instructions of each comment below.**

In [1]:
# Create and print a multiline string.



# Print the following string like it would appear as a title with each word capitatized.
# (Python ignores the convention of not capitalizing words "or" and "and".)
s = "the hobbit, or there and back again"



# Print the following string with the '=' characters stripped from beginning and end.
s = '===Pesky Equal Signs==='



# Print the index where the string 'find' is first found in the following string.
s = 'After you have finished your experiment, be sure to share all of your findings.'



# Print the following string with every occurrence of lowercase "th" replaced with "f".
s = 'This is just a pithy sentence with three replacements.'



# Print the list created by taking your multiline string above and splitting it by lines.



# Print the single string "Hen3ry" created by concatenating "Hen", the value in variable "tres" converted to a string, and "ry".
tres = 3



# Compute and print the Golden Ratio (https://en.wikipedia.org/wiki/Golden_ratio) to six decimal places using a format string.




## Regular Expressions

Regular expressions are a grammar for flexibly specifying _patterns_ to be matched in character sequences.  One common application is to extract email addresses from text.  Here is a complex regular expression from [http://emailregex.com/](http://emailregex.com/) to describe the form of an email address:

```[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+```

Let's take this apart. First consider the first portion within square braces:

```[a-zA-Z0-9_.+-]```

Square braces are used to define a set of characters.  The range 'a-z' means what you would expect: the inclusion of all lowercase Latin alphabetic characters.
We also see all uppercase Latin alphabetic characters, all digits, the underscore, the period, the plus, and the minus.  The "+" immediately after the this bracketed character set means "one or more from the pattern preceding", in this case, one or more characters from this set of options.  This username specification is followed in the pattern with the single character "@".  While the plus character is allowed in the username, we can see that it isn't allowed in the period-separated portions after the "@".

Let us now define and put this regular expression to work in Python with package "re":


In [3]:
import re

# Define email regular expression
email_regex = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

s = 'This example sentence contains email addresses for Gettysburg College Admissions <admiss@gettysburg.edu>, the Gettysburg College IT Helpdesk <trouble@gettysburg.edu>, and Faik E. Mayl <devnull@nospam.org>.'

# We can use our email regular expression to find all emails in this sentence and put them into a list:

all_emails = email_regex.findall(s)
print(all_emails)

# We can use the patterns to make new strings with substitutions for matched patterns:

s_no_emails = email_regex.sub('email@omitted.now', s)
print(s_no_emails)

# Parentheses can be used in a regular expression to define "groups", allowing us to get group chunks of the patterns matched as tuples:
email_regex_with_username_domain_groups = re.compile('([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
print(email_regex_with_username_domain_groups.findall(s))

# We can also name our groups with "(?P<name> )" syntax and get a list of dictionaries we can use for flexible access to our matched information.
email_regex_with_named_groups = re.compile('(?P<username>[a-zA-Z0-9_.+-]+)@(?P<domain>[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
email_dicts = [match.groupdict() for match in email_regex_with_named_groups.finditer(s)]
print(email_dicts)
print('Here comes trouble!', email_dicts[1]['username'])  # Get the username of the second match (index 1)

['admiss@gettysburg.edu', 'trouble@gettysburg.edu', 'devnull@nospam.org']
This example sentence contains email addresses for Gettysburg College Admissions <email@omitted.now>, the Gettysburg College IT Helpdesk <email@omitted.now>, and Faik E. Mayl <email@omitted.now>.
[('admiss', 'gettysburg.edu'), ('trouble', 'gettysburg.edu'), ('devnull', 'nospam.org')]
[{'username': 'admiss', 'domain': 'gettysburg.edu'}, {'username': 'trouble', 'domain': 'gettysburg.edu'}, {'username': 'devnull', 'domain': 'nospam.org'}]
Here comes trouble! trouble


# In Class

Perform the following exercises together:

## Modules and Packages

* From package "random" import functions "seed" and "gauss".
* Set the seed to 0 with "seed(0)".
* Execute the help function for "gauss".
* Create a list "data" with 1000 Gaussian random numbers with mean 2 and standard deviation 1.
* Import package "statistics".
* Use help or online documentation to help you compute and print the deciles of "data". (Your first value should be ~0.777.)

_Note: Looking up how to do something in a programming language is the norm of a programmer's experience.  Get comfortable using both help() and web searches._

## String Operations


In [2]:
# Print the following string in all UPPERCASE.
s = 'This is a test sentence.'



# Print the following string with the whitespace stripped from beginning and end.
s = '       This has both leading and trailing whitespace.       '



# Print the string 'Center me.' in the center of 40 characters (other characters being filled as spaces).



# Print the list entries of the following string list that begin with 'ab'.  Hint: Use a simple list comprehension.
l = ["{0:b}".format(i).zfill(4).replace('0','a').replace('1','b') for i in range(2 ** 4)]



# Print the following phone number with dashes removed (replaced with empty strings).
s = '800-555-1212'



# Print the string created by joining list "l" above with commas between entries.



# Use a format string to print "The answer to life, the universe, and everything is 42." where the 42 is inserted from variable "answer".
answer = 42



## Regular Expressions

Below is a string "text" assigned a multiline string copy-pasted from https://www.gettysburg.edu/faculty/faculty-resource-guide.
* All phone numbers within are of the form ###-###-####, where # is a digit.  Form a Python regular expression to describe this pattern.
* Compile and use this regular expression to find all phone numbers in "text" and create a list of them.
* Print the list of phone numbers in "text".

In [3]:
text = '''Academic resources
Office	Phone Number	Description
Academic Advising	717-337-6579	General advising questions, disability accommodation questions
Registrar’s Office	717-337-6240	AP/IB credit, transfer credit, Peoplesoft/Student Center questions, Registration Holds
Musselman Library	717-337-7024	Research Assistance, Interlibrary Loans, Course Reserves for Students
Orientation, dashboard and housing resources
Office	Phone Number	Description
Office of Residential and First Year Programs	717-337-6901	Trouble with log-ins, Dashboard question, Orientation questions, Housing questions/requests
Financial resources
Office	Phone Number	Description
Financial Services	717-337-6220	Student Account, Payment Plans, Making a payment
Financial Aid	717-337-6611	Financial Aid, Academic Merit Scholarships, Talent Scholarships
Technology resources
Office	Phone Number	Description
Information Technology Services/G-Tech	717-337-7000	Assistance with personal computers, Connecting to College Network, Accessing e-mail, general trouble shooting assistance
Health, mental health and well being resources
Office	Phone Number	Description
Health Services	717-337-6970	Assessment and Treatment of Acute Illness, Management of Stable and Chronic Illness, Stress Management, Weight Management, Well Care visits, In-House Lab, Health Education
Counseling Services	717-337-6960	Free confidential counseling services, emergency services, psychiatric services, self-help resources, skills workshops,
Campus Recreation	717-337-6428	Intramurals and Recreational Sports, Club Sports, Fitness Classes
Safety resources
Office	Phone Number	Description
Department of Public Safety	717-337-6911	Safety Escorts, Crime Prevention Programming, Fire and Intrusion Alarm Monitoring and Response, Patrol of Campus, Lost and Found, Courtesy Vehicle Jump-Starts, Response to Medical and Other Emergencies
Student Rights and Responsibilities	717-337-6907	Reporting Incidents of Bias, Bias Education and Advisory Council, Addressing violations of the code of conduct
Office of Sexual Respect and Title IX	717-337-6900	Violence Prevention, Title IX Response, Sexual Misconduct Response
Social and outside of the classroom resources
Office	Phone Number	Description
Office of Student Activities and Greek Life	717-337-6304	Clubs and Organizations, Social Programming, Greek Life- for second year students, Student Senate
Office of Multicultural Engagement	717-337-6311	First-Generation Student Support, Support for Affinity Groups, Social Programming, Awareness and Heritage Month Programming, Mentoring, Mosaic Cupboard, Academic Success Workshops
Garthwait Leadership Center	717-337-8444	Leadership Development for students
Eisenhower Institute	717-337-6685	Programs in Environmental leadership, Civil Rights, Women and Leadership, etc.
Center for Public Service	717-337-6490	Community Service and Volunteer Opportunities, Immersion Projects,
Women’s and LGBTQA+Life Resource Center	717-337-6991	Safe Space to study, hang out, and relax, Meeting area for clubs and organizations, Library, Programming
Career Engagement	717-337-6616	On Campus jobs, Career Exploration Programs, Internships, Externships, Shadow Programs
Study Abroad Programs	717-337-6866	Study Abroad Programs'''



# Homework

(1) Complete any in-class exercises you did not complete in class.

(2) Do the following:
* Import package "[urllib3](https://urllib3.readthedocs.io/en/latest/user-guide.html)", the most popular Python package for making HTTP requests to interact with websites.
* Assign variable "http" to be a new "PoolManager()" object from package "urllib3".
* Assign variable "url" to be the string '[http://cs.gettysburg.edu/~tneller/ds256/data/test.html](http://cs.gettysburg.edu/~tneller/ds256/data/test.html)'.
* Execute "response = http.request('GET', url)" to get the web page source from that URL and store it in variable "resp".
* Execute "print(response.data.decode('utf-8'))" to print the decoded downloaded web page source.

This ability will be foundational for "[web scraping](https://en.wikipedia.org/wiki/Web_scraping)" data from web pages and will be used in exercise 4 below.

(3) Follow the instructions of each comment below.

In [2]:
# Print the following string in all lowercase.
s = "WHY ARE CAPITAL LETTERS ASSOCIATED WITH YELLING?"



# Now, in addition, print the capitalized version of the lowercase string so that it appears like a properly capilatized sentence.



# Print the square root of 2 right-justified in 20 characters (the rest of the characters being spaces).



# Print the list of entries of the following string list that end with 'bb'.  Hint: Use a simple list comprehension.
l = ["{0:b}".format(i).zfill(4).replace('0','a').replace('1','b') for i in range(2 ** 4)]



# Print the list of words you get from the following sentence by splitting on the space character.
s = "This is a sentence with seven words."



# Print a single multiline string created from the following list of strings with one string per line.
l = ['This should be line 1.', 'This should be line 2.', 'This should be line 3.']



# Use the format string to print the following two lines with _____ replaced be the three entries of the list defined below.
# Amongst those interviewed were _____, ______ and _____. (without Oxford comma)
# Amongst those interviewed were _____, ______, and _____. (with Oxford comma)
entries = ['Merle Haggard’s two ex-wives', 'Kris Krisofferson', 'Robert Duvall']
# Optional: Using *entries as a parameter to the format function causes each list entry to be interpreted as a separate parameter to format.



(4) For this exercise, you will make use of exercise 2 and scrape some simple data from Gettysburg College's page on courses satisfying curricular requirements: [https://www.gettysburg.edu/offices/registrar/courses-fulfilling-the-gettysburg-curriculum](https://www.gettysburg.edu/offices/registrar/courses-fulfilling-the-gettysburg-curriculum).
* Make an HTTP GET request of URL 'https://www.gettysburg.edu/offices/registrar/courses-fulfilling-the-gettysburg-curriculum'
* Instead of printing the decoded data, assign it to string ```s```.
* There is a lot of page metadata that could generate false-matches for seeking course numbers.  We know that what we want is between the strings defined below as "start_text" and "end_text".  Find the indices of ```s``` where each of these are found. Then reassign ```s``` to be the substring of s from the start index to the end index.
* Courses seem to follow a pattern of 3 digits followed optional by a dash and one more digit.  Compile regular expression \D(\d{3}(?:-\d)?)\D' and assign the compiled regular expression to variable course_num_regex.  (Note: \D means any non-digit, \d means any digit, ? means what precedes it is optional, and ?: at the beginning of the parenthesized group means ignore the group.)
* Perform a ```findall``` of this regular expression on the ```s``` to get the list of the page's course numbers and then print that list.

In [1]:
start_text = 'A. 32 Course Units'
end_text = 'Individualized Study courses and Internships may not be used to fulfill Curricular Goals.'
# Uncomment the next line to disable the warning once you have imported urllib3:
# urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)



(end of homework)