WorkFlow and Methodology

Overview

  • The following map summarized my general WorkFlow for data analysis based on Python, which is a popular community-driven and user-friendly language.

  • I hope the WorkFlow and Methodology beneath can also serve as a reference for other language/projects, where Python is found useful as a prototyping/demonstration tool.



  • This is a work in progress, and will be aggregated into a talk about Open Research. My idea of open research is adopted from the computational thinking from open source community.

About ME


Some of the basic concepts include,

  • continuous integration: test-driven and object-oriented for easier maintenances.
  • well documented: including report/web-page/presentation for better understanding.
  • reproducibility: containerized developing environment, archived data and git-logged code repository should be accessible to replicate.
  • community driven: public ready to communicate on peer reviews and general feedbacks.

Drafting

Outline: MindNode

Use mind map to conveniently remark and organize the outline of a project.


Content type in MindNode

  • image
  • smart link with http://
  • note
  • task

Extend: export to Markdown

Export the mind map to markdown document to extend the details on each topic.


Markdown Basics: the key formatting syntax. Markdown is also compatible to use html markup most of the time.

<!-- image with style -->
[<img src="img_url" style="float: right" width="100px">](img_url)

<!-- cross reference -->
TOC
- [title](#tag)
<a name='#tag'>title</a>

<!-- footnote -->
<sup>[1](#myfootnote1)</sup>
<a name="myfootnote1">1</a>: Footnote content goes here

Developing

Environment: Docker

Containerize: balancing between system isolation and performance like a sandbox for micro-service.

Basic concepts

  • Images - The blueprints of our application which form the basis of containers. In the demo above, we used the docker pull command to download the busybox image.

  • Containers - Created from Docker images and run the actual application. We create a container using docker run which we did using the busybox image that we downloaded. A list of running containers can be seen using the docker ps command.
  • Docker Daemon - The background service running on the host that manages building, running and distributing Docker containers. The daemon is the process that runs in the operating system to which clients talk to.

  • Docker Client - The command line tool that allows the user to interact with the daemon. More generally, there can be other forms of clients too - such as Kitematic which provide a GUI to the users.
  • Docker Hub - A registry of Docker images. You can think of the registry as a directory of all available Docker images. If required, one can host their own Docker registries and can use them for pulling images.

1. Start from base image

Basic commands

docker pull image_name  # pull public image/repository from a registry

docker build [--no-cache] -t image_name path/to/Dockerfile [-f renamed-dockerfile]

docker run -it  # interactive
           --rm  # rm container when exit
           -d  # run as detached
           -p 8888:8888  # port fwd to host
           -e DISPLAY=$DISPLAY  # set environment variable
           -u user  # username/uid in image
           -v path/to/local:path/to/container  # mount directory
           image_name
           [command]

docker port container_id  # show the open ports of a container instance
docker start/attach/stop/rm container_id  # manage a container instance
docker rmi image_id  # remove an image
  • public registry: be caution about potential security risk in 3rd party image/repository.

2. Record needed ingredients in requirement.txt

While developing, record the additional python packages in a text file named requirement.txt, which will be useful to construct the Dockerfile to automatically configure the developing environment, as well as hosting an interactive Jupyter notebook with mybinder.


# Example requirement.txt
SomeProject
SomeProject == 1.3
SomeProject >=1.2,<.2.0
SomeProject[foo, bar]
SomeProject~=1.4.2

3. Compose building recipe to Dockerfile

Some resources


Steps

  1. start with a base image that meets the need of the project
  2. customize with additional dependency using RUN sudo pip install -r requirement.txt
  3. add special system setup and port configuration as needed

Prototyping: Python

Some principles

  • Object-oriented programing
    • function -> method
    • variable -> attribute
  • Computational thinking: see the full image here

A few suggestions

  • think about goals of code
  • break into reasonable classes
  • pseudocode it up
  • check for efficient algorithm
  • test each function and class
  • assemble code in main
  • run
  • optimize if necessary

Code Styling: PEP8

  • indentation: 4 spaces
  • snake_case: packages, modules, functions
  • CamelCase: classes, attributes
  • alltogether: variables, add _ for build-in variables
  • space near operators, except for func(a=1, b=3) or c = a/b
  • separate imports
  • >2 space # 1 space inline comments, docstring
  • break liner /

Practical steps

1. Dev with Jupyter, note issues: interactive notebook is very handy at development


2. Aggregate to python script: modularize codes into functions

More to read and adopt


3. Checkpoint scripts with git: git log the progress

  • Github git cheat sheet: some basic operations

  • Create a private git repository on any ssh server with 6 lines

# on server
mkdir project.git
cd project.git
git --bare init

# client side
git init
git remote add mygitserver ssh://git@remote-host[:port]/path/to/project.git
git push mygitserver master

Packaging: iPython

When the code is ready to share,

  1. create a parent directory
  2. create setup script
  3. create MANIFEST.in for list of files
    • include README
    • include LICENSE
    • include setup.py
    • recursive-include folders scripts
  4. python setup.py sdist

1. Modularize function: if not done earlier

class CamelCase():
    def __init__():  # initialize
        self. ...
    def __repr__(self):  # print 
        return "..."

if __name__ == '__main__':
    body of program

2. Unittest: remember the issues we note down during the developing? These are good cases to write up tests about. A more proactive concept is test-driven programming.

├── __init__.py
├── code.py
├── func_a.py
├── func_b.py
├── func_c.py
└── tests
    ├── __init__.py
    ├── test_funcs.py
    └── test_something.py

testing framework

  • pytest
    • conda install pytest pytest-cov
    • compose tests under tests/
    • run py.test
  • nose

3. Continuous Integration: use continuous integration to automatically test when something changes in repository.

  • Travis: require .travis.yml file to config
  • circleci
  • AppVeyor

# example .travis.yml file

language: python
python:
  - "3.6"
# command to install dependencies
install: "pip install -r requirement.txt" 
# command to run tests
script: pytest

4. Profiling & Optimization

Premature optimization is the root of all evil. -- Donald Knuth

  • computation amount
  • memory usage
  • input/output
  • storage

Tips from Cameron Hummels

  • Decide what you are optimizing over
  • Computer time versus person time
  • Write readable code first, then optimize
  • Use profilers to identify bottlenecks
  • Address bottlenecks one at a time
  • Latest Python is most optimized
  • Try new approaches and profile/testit

  • Object oriented code
  • NumPy arrarys are optimized
  • Vectorize loops when possible
  • List comprehensions
  • Avoid building lists through appends
  • In place operations as opposed to rebuilding
  • Cython
  • Numba

time  # coarse total time
%%timeit  # code snippets
cProfile  # deep profiling with viz tools
  pstats  # text-based
  snakeviz
  runsnakerun  # pipeline tool
  pyprof2html  # html tool
line_profiler/kernprof  # line-by-line function
memory_profiler  # memory consumption and memory leaks

Parallel computing

  • multithreaded
  • multiprocessing
  • MPI and mpi4py

5. Documenting: essential for future revisit or further development.

versioning: x.y.z (E.g., 0.2.3, 2.7.12, 3.6)

• change x for breaking changes • change y for non-breaking changes • change z for bug-fixes


docstring and comments tips

  • document while coding
  • document the interfaces of modular
  • use descriptive names
  • consistent style/format
  • docstring

Publishing

Presentation

Following my WorkFlow, most of the work has been done at this stage. The rest can be carried out in very minimum effort with decent finish.


Example finish: use ? for keyboard shortcuts to control the slides.


1. MindNote -> Markdown: re-arrange and convert the outline mind map to markdown.


Warning: Jupyter notebook should be re-organized for presentation, especially dissertation defense! The order of work is not necessary the order of talk! Check my LSST talk for some tips.


2. Markdown -> HTML: extend the details in markdown and convert to html.

Pandoc: powerful tool for conversion.

  • install: brew install pandoc
  • use % for frontpage info
  • usage demo: pandoc -s --mathjax -i -t --slide-level=2 revealjs WorkFlow.md -o WorkFlow.html

3. Slideshow: HTML + reveal.js

  • add -V revealjs-url=http://lab.hakim.se/reveal-js when using pandoc to convert
  • or download reveal.js to the same directory of the converted directory

  • Now it's ready to open the html file to start the slideshow, use ? for keyboard shortcuts to control the slides.


Workshop

To take the research/project to a workshop, we need to recall what we've done.

1. Config env.: Dockerfile

With everything done, it is now easy to put all the ingredients and recipe together into a Dockerfile.


# Example Dockerfile for python
cat > Dockerfile <<EOF
FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "./your-daemon-or-script.py" ]
EOF

# build
docker build -t pydev .
docker run -it --rm pydev

2. Demo: Markdown + Scripts -> Jupyter notebook

Following the mind map and resulting markdown file, we can put the outline structure in Jupyter notebook since it natively supports markdown formatted cells. We can fill in the function calls and visualization codes in between.


My secret on toggling code cells

  • nbextension
    • codefolding: $ jupyter nbextension enable codefolding/main
    • more extenstions: spellchecker, Table of Contents, Autoscroll,...
  • nbconvert with --template template.tpl
    • hide/remove code in one step templates
    • more functionality available

Another wheel to edit slide styles on the fly

  • RISE: good for dev./workshop
    • demo: interactive live rendering, based on reveal.js
  • reveal.js: good for prod./presentation
    • post-process with nbconvert

3. Slideshow: nbconvert + reveal.js

Wrap up commands to convert notebook into slideshow

jupyter nbconvert notebook.ipynb --to slides --reveal-prefix reveal.js [--template hidecode/rmcode.tpl]

wget https://github.com/hakimel/reveal.js/archive/3.6.0.tar.gz | tar -xvzf -

open notebook.slides.html

Repository: GitHub

When it is ready to take the project to the public, there are a few wheels very handy to make it more appealing.


Live slideshow: add some markup in the url of the html file in repository to render the slideshow in live, not always working.

reveal html files: go to http://htmlpreview.github.io/?+git_html_url+?print-pdf


Live notebook demo: binder

Everything is ready, just paste the repository link to mybinder.org

  • configure: add requirement.txt or Dockerfile
  • launch on binder: recall the RISE demo

  • static option: open http://nbviewer.jupyter.org/ and paste the github url of the ipynb file


Project webpage: github.io + HTML

github.io: use any of the converted html file to set it up in 3 steps

  • create a repository with username.github.io
  • add HTML file
  • go to https://username.github.io

Documentation: Read the Docs

This step depends on how often and well the project is documented. If earlier guide is followed, there is no pain at all.

  • sphinx with reST: sphinx-quickstart -a "Name" -p Repo -v 0.1 --ext-autodoc -q
  • doxygen
  • readthedocs: link github & auto build

Building test: Travis CI

Follow the manual/documentation!


tricks

# load dataset based on memory
sources = pd.read_csv(chunksize=100000) 
for i, chunk in enumerate(sources):
    ...
Published: Sat 14 July 2018.
Updated: Tue 14 August 2018. By Dongming Jin in

Comments !