Setting up a Python Analytics Server
Suhas Somnath
Advanced Data and Workflows Group
National Center for Computational Sciences
Oak Ridge National Laboratory
10/9/2017
Table of contents:
- Setting up a Python Analytics Server
- Table of contents:
- Introduction:
- Configuration
- Step 0: Getting a CADES Cloud account:
- Step 1: Creating and Launching an instance:
- Step 2: Accessing the Instance:
- 1. Find the IP address of your machine
- 2. Get the public SSH key
- ORNL Mac / Linux computer:
- ORNL Windows computer:
- From your personal computer:
- Step 3: Installing analytics packages on the instance:
- Running
Introduction:
- R and python are two of the most popular languages for analyzing scientific data. However, it can be challenging for first-time users to set up these familiar languages on cloud computing resources for data analytics.
- These self-service instructions will guide you through the process of creating a
virtual machine
(VM) on the CADES Cloud cloud (comparable to a powerful desktop computer and scalable) that you could use instead of your personal computer for data analysis via a Jupyter notebook server. - The entire setup process (besides step 0) should take about 20 minutes. Once set up, connecting to the notebook server should only take a few button clicks.
Support:
- CADES provides the ability and support to create and use virtual machines. Users are free to use such VMs for a variety of purposes, such as running a Jupyter notebook server. Users are responsible for maintaining the software installed on their own VMs (e.g. - python packages, Jupyter server, etc.).
- Please follow all steps in this guide to ensure a smooth setup of your analytics VM. For questions regarding the virtual machine itself (steps 0-2), please contact CADES support. If you have any questions regarding the setup of anaconda, Jupyter, etc. (steps 3-6) please feel free to contact me.
Best Practices and ethical use of the cloud:
A virtual machine is like a public-use desktop or a laptop. It costs money to run VMs and reserving resources for your VM, precludes others from utilizing resources. Here are a few guidelines for using and managing VMs:
- If you are not using the VM for extended periods or are not using all the horsepower, consider resizing to a smaller flavor (fewer CPU cores and smaller RAM). Remember that you can always resize it back to something bigger whenever necessary.
- Additionally, you can shut down your VM, just like you would shut down your personal computer.
- Consider deleting VMs you will never use again.
Other notes:
- The remote machine runs the Ubuntu (Linux) operating system. It is recommended to take this short tutorial if you are new to Linux and/or the command line interface.
- CADES has several helpful guides on learning the basics of Linux as well.
- This document is limited to the instructions necessary for setting up a virtual machine for data analytics using python and is not intended to serve as a comprehensive manual for maintaining and administering your virtual Linux machine or implementing other advanced analytics features such as JupyterHub. Please refer to other online resources for such topics.
- Though you can set up a VM for analytics using R, this is NOT extendable to Matlab and similar proprietary / paid software packages.
- This virtual machine will be only be accessible within the ORNL network. You would need to use the VPN on an ORNL laptop when off campus or access the machine via Citrix on your personal computer.
- Many thanks to the Chris Layton, Pete Eby, Ketan Maheshwari from CADES and Ondrej Dyck from CNMS.
Configuration
Step 0: Getting a CADES Cloud account:
- You will need to request for access to the CADES Cloud from the following the instructions here. You should receive a mail within a few minutes to 1-2 hours regarding the approval of your request.
- OPTIONAL: By default, everyone has access to virtual machines that have up to 8 GB of memory and 16 CPU cores. If you need more, you can request to have your quotas increased by contacting CADES including details such as your three character
UCAMS id
, justification, and duration for the increase in quota in the email. - OPTIONAL: Consider joining the
#ornl_cloud
channel on the CADES SLACK group to communicate with other users of the CADES cloud.
Step 1: Creating and Launching an instance:
You can follow the four steps in CADES’ documentation in the links below but pay attention to the following notes:
Log in to Horizon, name your VM
- follow the instructions on this page as is.Choose a flavor, image, and boot source
- follow instructions here but pay attention to a few things:- At the
Source Tab
:Delete Volume on Instance Delete
: Set toNo
if you want to drive to be kept alive even though the instance is deleted. This is generally a good idea - you can always delete the volume (after you delete the instance) if you don't need it.Volume Size
: This is the size of the storage drive that will contain the operating system, data, python packages etc. You are recommended to set this to16 GB
or larger. If you intend to use your CADES Cloud account exclusively for this analytics server, you can use up your entire quota (eg. 40 GB). Like any personal computer, you can always add volumes to your instance but starting off with a large enough volume can mitigate additional work. Please see this document if you already created an instance but need to add a new storage volume.
- At the
Flavor Tab
: This mainly determines the number of processor cores and memory. You can change the flavor after creating the instance so do not worry about this step very much. Pick the flavor that best suits your applications: - Pick any flavor that begins with
m1.
if you do a lot of statistical analysis that requires a large RAM compared to the number of CPU cores - Pick any flavor beginning with
c1.
if you tend to run a lot of small computations in parallel. - For additional flavors request CADES to increase your quota. See Step 0.
- You can always run multiple machines in parallel. So you could distribute your memory / CPUs among two machines that fully utilize your quota.
- At the
- Set up a security group as it says in the document.
- Configure a key pair for accessing the VM as it says in the document.
Step 2: Accessing the Instance:
The instructions below are a simplification of the official CADES documentation:
- For Mac / Linux
- For Windows
1. Find the IP address of your machine
- While in the
Horizon
interface you used for creating the instance to your VM, Click on theCompute
tab, then theInstances
sub-tab -
Copy the
IP address
listed for your instance
2. Get the public SSH key
- Click on the
Access and Security
tab and then navigate toKey Pairs
.
- Click on the key. In this case –
CADESCloudKey
- Copy the contents of
Public Key
and paste into a text editor likeTextEdit
on MacOS orNotepad
in Windows. Read the next step before saving: - Before saving, make sure to change the format to
plain text
. This is especially true ofTextEdit
in Mac (in the Menu bar - Go toFormat
->Make Plain Text
)Wordpad
(when saving, selectText Document (.txt)
instead of the defaultRich text
in the pull down menu) in Windows for example. - Save the file as
id_rsa.pub
From here on follow instructions specific to your operating system:
ORNL Mac / Linux computer:
Before you begin: These instructions are for ORNL computers only. Instructions for personal computers will follow. If you are outside the ORNL network but working on an ORNL computer, you will need to connect to the ORNL VPN using your PIN and RSA token to get back into the network
1. Moving the keys:
- OPTIONAL but Recommended: If you are interested in accessing your instance from your personal computer, it is recommended to make a copy of your public and private keys and place the copies someplace on
ORNLDATA
(e.g. - My Documents). - Open the
Terminal
application and navigate to the directory where you stored your private key. - Rename your private key from the original name (for example - CADESCloudKey) by typing:
$ mv CADESCloudKey id_rsa
- Move the private and public keys to
~/.ssh/
. For example, if you stored both the private and public keys in Documents.
$ cd Documents
$ mv id_rsa ~/.ssh/id_rsa
$ mv id_rsa.pub ~/.ssh/id_rsa.pub
2. OPTIONAL: Shortcuts!
Aliases:
You can set up aliases that make it easier to refer to your remote machine. Aliases can turn commands like:
ssh cades@172.22.3.50
to something far simpler like: ssh jupyterVM
.
Graphical interface for SSH:
The Mac Terminal
application comes with utilities that simplify the ssh process with a graphical interface. If you are comfortable with the command line and do not mind typing ssh
/ sftp
commands you can skip this step.
If you are interested in this quick setup, follow the instructions here. Please only follow instructions till step 6 (set up the entries and do not follow any steps including and following those that expect you to click on the Connect
button. We will get to this in Step 4
below)
3. Connecting to the instance
- Via the command line interface on the
Terminal
app:
ssh cades@172.22.3.50
- Via the graphical interface of the Mac
Terminal
app: - Open the
Terminal
app. - Go to
Shell
→New Remote Connection
- Ensure that
Secure Shell (ssh)
is selected on the left-hand column, then select the first entry you made (cades@172.22.3.50
in my case) in the right-hand column, and click on theConnect
button in the bottom right.
ORNL Windows computer:
Before you begin: These instructions are for ORNL computers only. Instructions for personal computers will follow. If you are outside the ORNL network but working on an ORNL computer, you will need to connect to the ORNL VPN using your PIN and RSA token to get back into the network.
- Install PuTTY: PuTTY should be preinstalled on all ORNL Windows computers. However, if you don’t have PuTTY installed, install it via the following links:
- OPTIONAL but Recommended: If you are interested in accessing your instance from your personal computer, you are recommended to make a copy of your public and private keys and place the copies some place on
ORNLDATA
. - Configure PuTTY to connect to your instance by following the instructions starting from the topic titled
Connect to Your VM Instance Using PuTTY
in CADES' instructions - Configure the tunneling to connect to the Jupyter notebook server by following the instructions here
From your personal computer:
- Log in via the Citrix page
PuTTY
setup:- Select the
ORNL General Desktop
application - Follow steps 2-4 in the instructions laid out for ORNL Windows computers above.
- Select the
- You can access your VM through at least two routes:
- Recommended: In the
Citrix
menu, select thePuTTY
application and use it as you would use an ORNL Windows computer. - In
Citrix
, select theORNL General Desktop
application and use thePuTTY
application to access your VM. This may be slow (bandwidth wasted on transporting the bits of the Windows virtual machine) and tedious (you cannot forward the Jupyter notebook server to your personal computer - it would stay within the Windows virtual machine). This option is preferable in the event that you want to upload data / code from yourORNLDATA
to your VM.
- Recommended: In the
Step 3: Installing analytics packages on the instance:
- Download Anaconda 5.2 -> python 3.6. You can download a different version if you wish.
$ mkdir temp
$ curl https://repo.continuum.io/archive/Anaconda3-5.2.0-Linux-x86_64.sh > temp/Anaconda3-5.2.0-Linux-x86_64.sh
- Change privileges before installing Anaconda
$ chmod +x temp/Anaconda3-5.2.0-Linux-x86_64.sh
- Install Anaconda:
- Start the installer
$ bash temp/Anaconda3-5.2.0-Linux-x86_64.sh
- Follow the prompts to install Anaconda. Accept the license, say yes to installing to the default location, say yes to prepending anaconda to the path.
- Delete temporary installation folder:
$ rm -r temp
- Switch to anaconda environment:
$ source ~/.bashrc
-
Install missing packages for wholesome Jupyter functionality:
- Enable ability to export to pdf in Jupyter:
$ conda install -c anaconda-nb-extensions nbbrowserpdf
- Enable javascript for interactive elements in Jupyter:
$ jupyter nbextension enable --py --sys-prefix widgetsnbextension
- Enable ability to export to pdf in Jupyter:
-
OPTIONAL: To simplify the command to start up the Jupyter notebook:
- First create the configuration file:
$ jupyter notebook --generate-config
2. Open up the notebook:
$ nano ~/.jupyter/jupyter_notebook_config.py
3. Use the key combination `Ctrl`+`W` to search for `.open_browswer`
4. Uncomment the line
5. Set the flag to `False`
6. Search for `NotebookApp.port = 8888` using `Ctrl`+`W`
7. Uncomment the line
8. Set the `port` number to `8889` (or any number > 1024 for that matter)
9. Close the editor with `Ctrl`+`X`
10. Save the file
- OPTIONAL - You can always install any python packages from this point on. You could install deep-learning frameworks like Keras or TensorFlow but you are recommended to use optimized Docker containers for this. Please refer to this separate tutorial for this.
Running
Step 1: Starting a Jupyter server:
- Ensure that you don’t leave room for accidental damage to the rest of the VM (such as the anaconda folder etc.) by starting the Jupyter notebook in a new / separate folder. Perhaps this folder contains data + notebooks, etc. For now, we will make an empty folder and start the notebook from there:
$ mkdir workspace
$ cd workspace
- OPTIONAL: – Persistent Jupyter server: As it stands, if you close this ssh session, your command or operation (for example, a running jupyer server) will be aborted as well. In order to keep the jupyter server easily accessible, we will need to either use the
screen
or thetmux
commands. We will be using screen here. Note that this approach does not keep your ssh connection to the Jupyter server (discussed in the next step) alive if your local computer goes to sleep or is shut down. IF you need your computation / analysis to run even after you shut down your local machine, you are recommend to run your analysis as a script on the remote machine instead of using Jupyter notebooks. If you decide to usescreen
, type the following command BEFORE you initiate the Jupyter server:
$ screen
- Starting the Jupyter server:
- If you modified the configuration file that was optional in the previous step:
$ jupyter notebook
- If you did NOT follow the optional instructions, specify the port and no-browser flags
$ jupyter notebook --no-browser --port=8889
- OPTIONAL: If you ran the notebook with screen:
- You can now
detach
the screen using the key sequence:Ctrl
+A
,Ctrl
+d
. - You can now close the ssh session to the remote machine. This will NOT close your Jupyter server.
- You can now
$ exit
Step 2: Accessing the Jupyter server:
Mac / Linux:
Connection in the Mac Terminal app: 1. Open the Terminal. 2. Depending on which method you prefer (and have set up): - Command line interface:
$ ssh -N -L localhost:8889:localhost:8889 cades@172.22.3.50
- Graphical Interface: see [this document](tunnelling-remote-server.md#mac-access).
- Open a browser (Chrome is recommended for interactive widgets) and go to: http://localhost:8889/.
Windows:
- Close any open PuTTY connections to the VM.
- Open PutTTY, load the configurations for your machine and connect. You will be presented with a new SSH connection to the VM. You can close this if you do not need it.
- Open a browser (Chrome is recommended for interactive widgets) and go to: http://localhost:8889/.
Personal computer:
- Log in via the ORNL Citrix page.
- Select the
PuTTY
application. - Follow the same instructions for Windows computers.
Step 3: Shutting down the Jupyter server:
Once you are done working on your Jupyter server, you will need to:
- If you used screen
and closed your SSH connection to your virtual machine where you initiated the Jupyter server, SSH into your virtual machine:
1. Windows – use your saved PuTTY profile
2. Mac / Linux: Use either the command line or graphical interface described in Step 2. For the command line interface - open the terminal and replace with your IP address:
$ ssh cades@172.22.3.50
At this point, you should either have access to an existing SSH connection to the remote machine or you should have created a new connection in the preceding step.
- If you used
screen
, re-attach the screen where your Jupyter notebook server is running by typing:
$ screen –r
You should be seeing the print logs of the Jupyter server on the remote machine now.
- Press
Ctrl
+C
twice to shut down the Jupyter server as you normally would on your local machine.