Today is the second (and final) day of the CSV Conference in Portland and the following are notes I’ve taken regarding the talks I’ve attended:
- Contexts & Institutions (Keynote)
- Emoji Data Science
- Trends in Data in Time and Space
- Using Web Archives to Enrich the Live Web Experience Through storytelling
- Introducing d3.express, IDE for D3 (Keynote)
- Opinionated Analysis Development
- Applying software engineering to data analysis
- Dat (Data tables)
- Javascript for numeric computing
Keynote | Contexts & Institutions
Saving and indexing research quality federal environmental and climate data.
Laurie Allen
Follow @librlaurie
- Laurie is a librarian involved with data rescue and is a “rogue scientist” to save climate data from Trump
- DataRefuge
- formed over the concerns about the disappearance of research quality (“real”) federal environmental and climate data. The risk is real.
- The goal is to save copies of federal data
- Would like to see events regarding designated communities (to identity data in communities that should be saved and to save it)
- How do you slice and dice federal information?
- There is no list
- You can organize by data format & size, agency that produced the data, covered by federal records laws, or most valuable & vulnerable first
- No perfect way to slice and dice federal information
- Types of vulnerabilities: legal, technical, political, use & uniqueness of the data
- Event paths
- Seed & sort webpages (from WayBack Machine)
- DataRefuge: harvest, check, bag, describe the data
- Storytelling (one way of saving data, is to understand the data)
- long trail (what to do in future to make it more sustainable)
- Codification: spreadsheets -> workflow -> web app -> system
- Still work to be done eg. description is not quite clear what data is for
- Shout out to Environmental Data & Governance Initiative (EDGI)
- Dat Project & notion of Frictionless data
- Multiple meaning of metadata
- Metadata, paradata (universe of things one need to know to work with data) and software preservation
- When data is given to another person, there are assumptions in place to what the other person knows
- “Metadata is a love letter to the future” – Rachel Appel who was quoting someone else
- Institutions have lots of barriers
- But we need to build communities within and across institutions
- Federal deposit repositories is for when federal priorities change (they always do)
- Miss Adelaide R Hasse Librarian, Office of Public Documents
- data ~ stories; data and stories are deeply connected
- Q&A
- U.S. produce massive amounts of data and is used by international collaborations/groups
- How to make the internet better at not destroying things? Hard problem. Sending HTML file is probably not the way to go.
Emoji Data Science
Telling stories from emoji data
Emoji data science & journalism – Hamdan Azhar
Follow @HamdanAzhar
#csvconf thanks for all the <3 during my #emojidatascience talk ❤️, i just posted the slides here! https://t.co/uK7bStg8w9
— Hamdan Azhar (@HamdanAzhar) May 3, 2017
- Finding good questions, Analyzing data, Cleaning data, …
- Ask questions you are passionate about
- PRISMOJI (data science lab in New York)
- Brexit
- “The world’s reaction to Brexit, in emoji”
- Names + Emoji
- “Emoji sentiment analysis”
- “Here Are the Most Popular Emojis From the #Brexit Reaction” (published on Motherboard)
- Question: How do people use emojis to react to events in real time?
- Analysis: Twitter API to get data -> then analysis in R
- Finding the top emojis associated with “hashtag signatures”
- “The world’s reaction to Brexit, in emoji”
- “A Data Scientist’s Emoji Guide to Kanye West and Taylor Swift” (published on Motherboard)
- Finding the top 5 emojis associated with each person
- “Order coming out of chaos” – you don’t teach people how to use emojis (unlike english in school)
- Emojis during election
- Hillary vs Trump
- Hillary before and after midnight
- Emoji over time
- Word clouds associated with emojis
- Finding interesting story is about looking at questions you care about
- data journalism & storytelling
- The Raised Fist Emoji Is Social Media’s Resistance Symbol
- See red heart emoji -> protest is not always from a place of hate
- Ideas for the future: Emoji co-occurrence
also check out the @PRISMOJI #emojidatascience #rstats tutorial i mentioned https://t.co/A4lrCrrqmj
— Hamdan Azhar (@HamdanAzhar) May 4, 2017
Trends in Data in Time and Space
Analyzing geolocation data
Location - Trends in Time and Place – Christopher Moravec
Follow @Mor3Havoc
- Spacial reference of location
- There are different types of coordinate systems for location data
- Examples: latitude, longitude, address (geocoding), zip code, city, county, state, country
- Locating anything is about scale
- Consider how the location is grouped/aggregated
- eg. world cities, portland city limits, zip codes
- select the locate type depending on scale
- create custom bins for analysis
- zip codes are polygons & not always within city limits (eg. zip codes in Flint averaged the problems away)
- Spatial Reference and Datum
- “Earth is round and Maps are flats”
- “Earth is round, sort of”
- Calculating distance in latitude and longitude is false
- Heat Maps (tells how dense stuff is, but can also hide data) vs Hot Spots (look at statistical trend)
- Voter Registration Data with Hack Oregon
- Hot spot analysis for the number of people on a grid over time
- Looking from the map point of view from the start is the wrong way to go
- Found that the average age of voters is getting older
- Unable to find trends in space (probably because people are moving around a lot)
- Number of voters in household is getting bigger (more young people staying at home?)
- Mapping where young people go
Using Web Archives to Enrich the Live Web Experience Through storytelling
Automatically telling stories by referencing web archives
Using Web Archives to Enrich the Live Web Experience Through Storytelling – Yasmin AlNoamany
Follow @yasmina_anwar
- bit.ly/YasminPhD (slides)
- Repositories disappearing over the internet
- Resources are not persistent
- Web archives such as the Wayback Machine and Archive-it (subscription service from Internet archive) are services that archive the web
- But understanding the content of web archives is not easy
- Storify is a story tellings service
- Web archives contain 2 dimensions = {URI, time} & produce 4 types of stories:
- Fixed page, fixed time
- Fixed page, sliding time
- Sliding page, fixed time
- Sliding page, sliding time
- The Dark and Stormy Archives (DSA) framework
- Establish baseline of social media stories
- 28 mementos (quotes, pictures, videos, …) is a good number for the resources
- Detecting off-topic pages
- Data base error, financial problems, hacking, domain expired
- Duplicates
- Selecting representative pages based on quality metrics and attractive snippets
- Using Storify API to visualize stories
- Establish baseline of social media stories
- Stories are generated automatically and evaluated with Mechanical Turk to compare human-generated stories and the automatically generated ones
Keynote | Introducing d3.express, IDE for D3
Introducing d3.express IDE (not ready yet) for making coding easier and javascript generators
Mike Bostock
Follow @mbostock
Introducing d3.express: the integrated discovery environment. https://t.co/WfwCA9DDoD
— Mike Bostock (@mbostock) April 28, 2017
- The purpose of visualization is insight, not pictures
- “Programming is blindly manipulating symbols” – Bret Victor from ‘Drawing Dynamic Visualization’
- Symptoms of inhuman code: convoluted, mutable state, inscrutable
- Code often best tool we have, because most general tool we have (code has un-parallel expressiveness; combination of primitives)
Part 1. Reactivity
- Care only about the current state
- Can inspect the array that is produced by clicking
d3 = require("d3")
d3.csv("/path/to/file.csv")
d3.csv("/path/to/file.csv", d => {
return {
date: d.date,
close: +d.close // + to convert string to number
}
})
- Error is no longer global errors
- Order independent
// Use function
parseTime = d3.timeParse("%d-%-%y")
d3.extend(data, d => d.date) //order independent
d3.csv("/path/to/file.csv", d => {
return {
date: parseTime(d.date),
close: +d.close
}
})
- curly braces for more involved definition
{
var svg = DOM.svg(width, height);
return svg;
}
- it gets complicated real quick & harder to follow
- reactive programming let runtime handle it
- can use other software eg.
vegalite
to use d3.express - javascript generators (is a pull system during runtime)
// javascript generators
// variable = *{ ... yield ... }
projection = *{
var projection = d3.geoOrthographic();
while (true) {
yield projection.rotate([Date.now() 200, -20]);
}
}
- you can opt in more complexity (to avoid paying overhead cost of drawing, tossing, and redrawing)
viewof
operator is a shorthand.call(bush)
to view data
Part2: Visibility
- interactive programming improves our ability to scrutinize behaviour by modifying the code
- geometry + intuition
Part3: Reusability
- Reusability enables you to write less code
- Passive code is the in between of ‘one-off code’ and an actively maintained library
- You can rewire definitions while importing
import {chart, width, height} with {data, y} from "filesignature"
Part4: Portability
- The notebook is intended to run in the browser
- Web-first environment lets your code run anywhere (no installation necessary)
- Reference: “Explorable Explanation” by Bret Victor
- Portable notebooks allow users to see how the code runs, and allow users to fork and modify the code for better understanding
Opinionated Analysis Development
Shifting away from the blame culture
Opinionated Analysis Development – Hilary Parker
Follow @hspter
Grateful I got to share today at #csvconf the impact that @allspaw had on me during my years at @Etsy https://t.co/4YbIrqqfxU
— Hilary Parker (@hspter) May 3, 2017
- Hilary’s Podcast: Not So Standard Deviations
- Concepts: reproducible, accurate, collaborative
- Developing the narrative of the analysis
- What models are you going to use?
- What arguments are you going to make?
- How are you going to convince your audience of something?
- Technical artifact
- Tools you are going to use?
- Technical and coding choice (tidyverse, R, python)
- Need to define the process
- Reference: The Field Guide to Understanding ‘Human Error’ by Sidney Dekker
- Switch from “blaming” a person to blaming the process that they used
- The person did not want to make an error
- the current system failed the person with good intentions
- Don’t play the blame game
- “Blameless PostMortems and a Just Culture” - John Allspaw
- Error occurs -> Blameless PostMortems -> Adopt a new process
- Shift from “I know better than you”, to there are “lots of people who have run into this problem before, and here’s the software for solution”
- How not to win friends and influence people:
- “Well … Actually …”
- Arguing over twitter
- Throw tools at them
Applying software engineering to data analysis
Applying software engineering practises to data analysis – Emil Bay
Follow @emilbayes
- Emil is a founder of CommodiTrader
- Decoupling
- Take program (that may be messy or doesn’t scale well for human memory) and pull it a part so it’s easier to see what is going on
- UNIX Philosophy
- Write programs that do one thing and do it well
- Write programs to work together
- Clear inputs and outputs to link programs together
- Contract through *rda (in R)
- Checkpoint - “Save” computation (to intermediate files)
- reproducible (from sharing intermediate results and having the same starting point)
- Idempotent (should always get the same output for a given an input)
- Makefile
- to solve dependencies & compose targets
- automatic invalidation
- proven technology
- still state of the art
## Format of make target
output: inputs
command
## Example
rdata/crawl.rda: crawl.R
R --vanilla --slave --file=crawl.R
- Pure functions
- can only depend on the input you give it
- inputs need to be explicit, but “side-effect” talks to web service (output not reproducible)
- Pure functions cannot have side effects, state mutation, and output must be directly from input
- with pure functions, you are able to do testing
- Defensive programming to make assertions (
stopifnot
,assertthat
for R)
Data Tables | Dat
Project Dat
Follow @dat_project
- Project Dat
- “Dat is a grant-funded, open-source, decentralized data sharing tool for efficiently versioning and syncing changes to data.”
$ npm install -g dat
$ dat share path/to/my/data
Javascript for numeric computing
Math, Numeric Computing, and JavaScript – Athan Reines
Follow @kgryte
Thank you to everyone at #csvconf who came to my talk about #math & #javascript! Here are my slides! https://t.co/8JeRXgd1lx @CSVConference
— Athan Reines (@kgryte) May 4, 2017
Why JS?
- Language of the Web
- Numeric computing is not typically thought to be done with Javascript
- JS is dynamically compiled and faster than R and python
- Community of numerical computing in JS is small
- Lack of JS Libraries (such as numpy in Python) for numerical analysis
- You might want to leverage JS because:
- Web APIs
- Rendering
- Visualization/computation
- Ubiquity
- Distribution (only need URL in browser)
- Package management
- Applications you can do
- Edge computing (outsource computing to client’s machine)
- Cross-platform
- Compute Intensive
- Interactive Data Analysis
- Integrated machine learning in web browser context
- AI powered
State of Math in JS is not that great
- No standard algorithms
- No min precision
- portability
- No common codebase
- Slow pace of innovation
- Bugs
State of Ecosystem is also not that great
- False assumptions
- Poor implementations (how do they compute variability?)
- Inefficient scope
- Lack of ambition
Standard library (stdlib)
- A standard library for JavaScript and Node.js with emphasis of numerical computing
- yes REPL help document
-
Still in active development; your help is needed
- “Web Assemblies” is a is an efficient low-level bytecode format for in-browser client-side scripting
- Javascript is asynchronous