Day68: csv,conf,pdx (day2)

Posted by csiu on May 3, 2017 | with: 100daysofcode
  by Santina Lin (in Portland, csv,conf,v3)

Today is the second (and final) day of the CSV Conference in Portland and the following are notes I’ve taken regarding the talks I’ve attended:

  • Contexts & Institutions (Keynote)
  • Emoji Data Science
  • Trends in Data in Time and Space
  • Using Web Archives to Enrich the Live Web Experience Through storytelling
  • Introducing d3.express, IDE for D3 (Keynote)
  • Opinionated Analysis Development
  • Applying software engineering to data analysis
  • Dat (Data tables)
  • Javascript for numeric computing


Keynote | Contexts & Institutions

Saving and indexing research quality federal environmental and climate data.

Laurie Allen

  • Laurie is a librarian involved with data rescue and is a “rogue scientist” to save climate data from Trump
  • DataRefuge
    • formed over the concerns about the disappearance of research quality (“real”) federal environmental and climate data. The risk is real.
    • The goal is to save copies of federal data
    • Would like to see events regarding designated communities (to identity data in communities that should be saved and to save it)
  • How do you slice and dice federal information?
    • There is no list
    • You can organize by data format & size, agency that produced the data, covered by federal records laws, or most valuable & vulnerable first
    • No perfect way to slice and dice federal information
    • Types of vulnerabilities: legal, technical, political, use & uniqueness of the data
  • Event paths
    • Seed & sort webpages (from WayBack Machine)
    • DataRefuge: harvest, check, bag, describe the data
    • Storytelling (one way of saving data, is to understand the data)
    • long trail (what to do in future to make it more sustainable)
  • Codification: spreadsheets -> workflow -> web app -> system
    • Still work to be done eg. description is not quite clear what data is for
    • Shout out to Environmental Data & Governance Initiative (EDGI)
  • Dat Project & notion of Frictionless data
  • Multiple meaning of metadata
    • Metadata, paradata (universe of things one need to know to work with data) and software preservation
    • When data is given to another person, there are assumptions in place to what the other person knows
    • “Metadata is a love letter to the future” – Rachel Appel who was quoting someone else
  • Institutions have lots of barriers
    • But we need to build communities within and across institutions
  • Federal deposit repositories is for when federal priorities change (they always do)
  • Miss Adelaide R Hasse Librarian, Office of Public Documents
  • data ~ stories; data and stories are deeply connected
  • Q&A
    • U.S. produce massive amounts of data and is used by international collaborations/groups
    • How to make the internet better at not destroying things? Hard problem. Sending HTML file is probably not the way to go.

Emoji Data Science

Telling stories from emoji data

Emoji data science & journalism – Hamdan Azhar

Analyzing geolocation data

Location - Trends in Time and Place – Christopher Moravec

  • Spacial reference of location
    • There are different types of coordinate systems for location data
    • Examples: latitude, longitude, address (geocoding), zip code, city, county, state, country
  • Locating anything is about scale
    • Consider how the location is grouped/aggregated
    • eg. world cities, portland city limits, zip codes
    • select the locate type depending on scale
    • create custom bins for analysis
    • zip codes are polygons & not always within city limits (eg. zip codes in Flint averaged the problems away)
  • Spatial Reference and Datum
    • “Earth is round and Maps are flats”
    • “Earth is round, sort of”
  • Calculating distance in latitude and longitude is false
  • Heat Maps (tells how dense stuff is, but can also hide data) vs Hot Spots (look at statistical trend)
  • Voter Registration Data with Hack Oregon
    • Hot spot analysis for the number of people on a grid over time
    • Looking from the map point of view from the start is the wrong way to go
    • Found that the average age of voters is getting older
    • Unable to find trends in space (probably because people are moving around a lot)
    • Number of voters in household is getting bigger (more young people staying at home?)
    • Mapping where young people go

Using Web Archives to Enrich the Live Web Experience Through storytelling

Automatically telling stories by referencing web archives

Using Web Archives to Enrich the Live Web Experience Through Storytelling – Yasmin AlNoamany

  • bit.ly/YasminPhD (slides)
  • Repositories disappearing over the internet
    • Resources are not persistent
    • Web archives such as the Wayback Machine and Archive-it (subscription service from Internet archive) are services that archive the web
    • But understanding the content of web archives is not easy
  • Storify is a story tellings service
  • Web archives contain 2 dimensions = {URI, time} & produce 4 types of stories:
    • Fixed page, fixed time
    • Fixed page, sliding time
    • Sliding page, fixed time
    • Sliding page, sliding time
  • The Dark and Stormy Archives (DSA) framework
    • Establish baseline of social media stories
      • 28 mementos (quotes, pictures, videos, …) is a good number for the resources
    • Detecting off-topic pages
      • Data base error, financial problems, hacking, domain expired
      • Duplicates
    • Selecting representative pages based on quality metrics and attractive snippets
    • Using Storify API to visualize stories
  • Stories are generated automatically and evaluated with Mechanical Turk to compare human-generated stories and the automatically generated ones

Keynote | Introducing d3.express, IDE for D3

Introducing d3.express IDE (not ready yet) for making coding easier and javascript generators

Mike Bostock

  • The purpose of visualization is insight, not pictures
  • “Programming is blindly manipulating symbols” – Bret Victor from ‘Drawing Dynamic Visualization’
  • Symptoms of inhuman code: convoluted, mutable state, inscrutable
  • Code often best tool we have, because most general tool we have (code has un-parallel expressiveness; combination of primitives)

Part 1. Reactivity

  • Care only about the current state
  • Can inspect the array that is produced by clicking
d3 = require("d3")
d3.csv("/path/to/file.csv")

d3.csv("/path/to/file.csv", d => {
  return {
    date: d.date,
    close: +d.close // + to convert string to number
  }
})
  • Error is no longer global errors
  • Order independent
// Use function
parseTime = d3.timeParse("%d-%-%y")

d3.extend(data, d => d.date) //order independent

d3.csv("/path/to/file.csv", d => {
  return {
    date: parseTime(d.date),
    close: +d.close
  }
})
  • curly braces for more involved definition
{
  var svg = DOM.svg(width, height);
  return svg;
}
  • it gets complicated real quick & harder to follow
  • reactive programming let runtime handle it
  • can use other software eg. vegalite to use d3.express
  • javascript generators (is a pull system during runtime)
// javascript generators
// variable = *{ ... yield ... }
projection = *{
	var projection = d3.geoOrthographic();
	while (true) {
		yield projection.rotate([Date.now()  200, -20]);
	}
}
  • you can opt in more complexity (to avoid paying overhead cost of drawing, tossing, and redrawing)
  • viewof operator is a shorthand
  • .call(bush) to view data

Part2: Visibility

  • interactive programming improves our ability to scrutinize behaviour by modifying the code
  • geometry + intuition

Part3: Reusability

  • Reusability enables you to write less code
  • Passive code is the in between of ‘one-off code’ and an actively maintained library
  • You can rewire definitions while importing
import {chart, width, height} with {data, y} from "filesignature"

Part4: Portability

  • The notebook is intended to run in the browser
  • Web-first environment lets your code run anywhere (no installation necessary)
  • Reference: “Explorable Explanation” by Bret Victor
  • Portable notebooks allow users to see how the code runs, and allow users to fork and modify the code for better understanding

Opinionated Analysis Development

Shifting away from the blame culture

Opinionated Analysis Development – Hilary Parker

  • Hilary’s Podcast: Not So Standard Deviations
  • Concepts: reproducible, accurate, collaborative
  • Developing the narrative of the analysis
    • What models are you going to use?
    • What arguments are you going to make?
    • How are you going to convince your audience of something?
  • Technical artifact
    • Tools you are going to use?
    • Technical and coding choice (tidyverse, R, python)
  • Need to define the process
  • Reference: The Field Guide to Understanding ‘Human Error’ by Sidney Dekker
    • Switch from “blaming” a person to blaming the process that they used
    • The person did not want to make an error
    • the current system failed the person with good intentions
    • Don’t play the blame game
    • Blameless PostMortems and a Just Culture” - John Allspaw
    • Error occurs -> Blameless PostMortems -> Adopt a new process
    • Shift from “I know better than you”, to there are “lots of people who have run into this problem before, and here’s the software for solution”
  • How not to win friends and influence people:
    • “Well … Actually …”
    • Arguing over twitter
    • Throw tools at them

Applying software engineering to data analysis

Applying software engineering practises to data analysis – Emil Bay

  • Emil is a founder of CommodiTrader
  • Decoupling
    • Take program (that may be messy or doesn’t scale well for human memory) and pull it a part so it’s easier to see what is going on
  • UNIX Philosophy
    • Write programs that do one thing and do it well
    • Write programs to work together
  • Clear inputs and outputs to link programs together
    • Contract through *rda (in R)
    • Checkpoint - “Save” computation (to intermediate files)
    • reproducible (from sharing intermediate results and having the same starting point)
    • Idempotent (should always get the same output for a given an input)
  • Makefile
    • to solve dependencies & compose targets
    • automatic invalidation
    • proven technology
    • still state of the art
## Format of make target
output: inputs
  command

## Example
rdata/crawl.rda: crawl.R
  R --vanilla --slave --file=crawl.R
  • Pure functions
    • can only depend on the input you give it
    • inputs need to be explicit, but “side-effect” talks to web service (output not reproducible)
    • Pure functions cannot have side effects, state mutation, and output must be directly from input
    • with pure functions, you are able to do testing
  • Defensive programming to make assertions (stopifnot, assertthat for R)

Data Tables | Dat

Project Dat

  • Project Dat
  • “Dat is a grant-funded, open-source, decentralized data sharing tool for efficiently versioning and syncing changes to data.”
$ npm install -g dat
$ dat share path/to/my/data

Javascript for numeric computing

Math, Numeric Computing, and JavaScript – Athan Reines

Why JS?

  • Language of the Web
  • Numeric computing is not typically thought to be done with Javascript
  • JS is dynamically compiled and faster than R and python
  • Community of numerical computing in JS is small
  • Lack of JS Libraries (such as numpy in Python) for numerical analysis
  • You might want to leverage JS because:
    • Web APIs
    • Rendering
    • Visualization/computation
    • Ubiquity
    • Distribution (only need URL in browser)
    • Package management
  • Applications you can do
    • Edge computing (outsource computing to client’s machine)
    • Cross-platform
    • Compute Intensive
    • Interactive Data Analysis
    • Integrated machine learning in web browser context
    • AI powered

State of Math in JS is not that great

  • No standard algorithms
  • No min precision
  • portability
  • No common codebase
  • Slow pace of innovation
  • Bugs

State of Ecosystem is also not that great

  • False assumptions
  • Poor implementations (how do they compute variability?)
  • Inefficient scope
  • Lack of ambition

Standard library (stdlib)

  • A standard library for JavaScript and Node.js with emphasis of numerical computing
  • yes REPL help document
  • Still in active development; your help is needed

  • “Web Assemblies” is a is an efficient low-level bytecode format for in-browser client-side scripting
  • Javascript is asynchronous