Day67: csv,conf,pdx

Posted by csiu on May 2, 2017 | with: 100daysofcode
  by Celia Siu -- a totally serious conversation about cats (in Portland, csv,conf,v3)

Today is the first day of the CSV Conference in Portland and the following are notes I’ve taken regarding the talks I’ve attended:

  • Smelly London Project
  • Designing data exploration
  • Better Bar graphics
  • U.S. Open Data Policy (Keynote)
  • Data and abuse of power
  • Scratching Someone Else’s Itch
  • Misleading people with data
  • Continuous Data Validation for everyone
  • data.world (Data tables)
  • Corporate Data Science (Keynote)


Smelly London Project

Data/text mining of Medical Office Health (MOH) records of smells to interpret the health of 19th and 20th century Londoners

Smelly London: visualising historical smells through text-mining, geo-referencing and mapping” – Deborah Leem

  • www.londonsmells.co.uk (project page)
  • Produced dataset of smell-related words
    • From 19th and 20th century London
    • records written by the Medical Office of Health
  • Created smell categories
    • Used smell ontology/dictionary for categorization
  • Implied smells and diseases
    • There is correlation of smells and diseases
    • Future work to use machine learning to get the implied smells
  • Visualization of smells on the map of London

Designing data exploration

Design by exploring the data and then producing high level analyses (after asking a few key questions).

Designing data exploration: How to make large data sets accessible (and fun to use) – Simon Jockers

  • People are sitting on large datasets and there is a need for computer assisted reporting
    • See German National Archive (Heimatkunde) for an instance of publishing a dataset instead of a story
  • ProPublica is journalism in the public interest
  • Start by exploring the data (use statistical packages, browse, & look at the data from different methods) and then produce high level analysis of the data
  • Build tools (not graphics)
  • Things to consider:
    • What is the data? Tables? Graphs? Documents/text?
    • Should this really be public? Is it in the public’s interest? Contain sensitive information? Will there be problems that may arise?
    • Who are the users? What are their goals (interest in the topic)? Data literacy? Media consumption habits (eg. casual & on the go with lower attention; or high attention in office setting)?
    • How should the data be displayed? Macro far view (to get the big picture/high-level overviews) or Micro near view (to look at individual data points; anecdotes or stories)
    • Personalization
    • Embedding by integrating visualization with narrative
    • How to publish data & make it interesting?

Better Bar graphics

Data visualization has been used in different ways and can influence how the data is perceived by changing aesthetics.

Daniel Orbach The Journey to a better bar graph, and beyond – Daniel Orbach

  • Small decisions make large visual impact
  • Visualization as documentation
    • Christopher Scheiner 1630 AD (document sun spots)
    • using “small multiples”
  • Visualization as an opinion to express an argument
    • William Playfair 1822 AD (price of wheat)
  • Visualization for answers from big data
    • John Snow 1855 AD (source of cholera is from single infected water hole)
  • Data visualization evolves alongside information, complexity, and technology
  • By changing the shape, colour, & rhythms of a data visualization, the visualization influences how the data is perceived
  • Data visualization is only going to get harder
    • eg. Dataviz, meet machine learning (such as the TensorBoard)
  • Seeing critically is about seeing what is there and what is not there
  • Questions to ask when designing/consuming data visualizations
    • Should this be visualized?
      • Less than 20 points, maybe use a table
      • Way too complicated (too many variables & you don’t know what you are trying to show)
    • What assumptions am I making?
      • To some people, red=bad and green=good; maybe use neutral colours such as yellow and blue instead
    • What am I trying to depict?
      • Among items, over time, distribution, comparisons
  • Orbach uses Logger to document train rides
  • Orbach suggested to use Sketch as adobe editor

Keynote | U.S. Open Data Policy

Advocating for open data

Heather Joseph

  • SPARC | www.sparcopen.org (organization, whose goal is to make open the default in research and education)
  • There are a number of U.S. Information Policy Precedents, including OMB Circular A-130, “Management of Federal information resources.” (1996)
  • The “Open Government Directive”
    • Set expectations from day 1 of the Obama administration
    • This directive is to make information available in open format
    • see WayBack Machine (internet archive) of data.gov
  • We need talent & visibility in leadership roles
    • DJ Patil (the first US chief data scientist)
    • DATA Act for Digital Accountability and Transparency
  • Now we have to work through: Fake news, alternative facts, 404 errors (from pages being removed from websites), & gag orders
  • What we can do
    1. Play active, pragmatic defence
      • Data rescue events/ DataRefuge
    2. Recognize that progress is still possible
    3. Preserving data services active
      • We have to understand what the new priorities are for this administration and that open transport is out
  • American Taxpayers are entitled to the Research they pay for
  • Position open data as an enabling force
    • eg. Making financial report data open and machine readable to make it easier for potential investors to create jobs
  • “pdf is not searchable format”
  • Individuals make a difference
    • Speak out, step up, take action

Data and abuse of power

Types of data biases

Data & Abuse of Power – Moiz Syed

  • Trial and terror: The intercept
  • Telling a false/biased story
  • Spotting fake news
  • Bias:
    • We don’t collect it e.g. police shooting data
    • We don’t collect it and you can’t either (eg. cut funding for gun violence; NFL brain injury and mental disabilities)
    • We collect it but you can’t have it (eg. remove open Whitehouse data)
    • We frame what we share
      • how data is being aggregated
      • how the data is sampled
      • how the data is labeled
    • The data is the bias

Scratching Someone Else’s Itch

An itch is a problem and a scratch is the process of solving the problem.

Scratching Someone Else’s Itch – Adam Hyde

  • adamhyde.net (personal website)
  • Coko | https://coko.foundation for solving scholarly publishing problems with open source
  • Open-source is good for infrastructure, developer tools, but not for user-facing solutions
  • Why has open source fail?
    • Open source is not good at solving someone’s Itch
    • the developer is the code specialist; but the user is the use case specialist
  • Design First & Design with the User by sitting down with the user
  • The Cabbage Tree methods | https://www.cabbagetree.org
    • Good for: Building platforms, working with organizations, fixing workflows
    • Facilitated Design
    • Use case specialist to design their own products
    • Iterative Design -> Build Sessions
  • Example: Editoria

Misleading people with data

Analyses and visualizations of raw data can be misleading.

How to mislead the public – Philipp Burckhardt

  • Smaller schools do better on exams
    • Looking at the standard deviation of the mean, higher variability are in smaller schools
    • called the most “dangerous equation” by Howard Weiner
    • possible solutions: standardize variables by subtracting overall mean and dividing by stdev_x to make them comparable
  • Simpson’s paradox: Gender Discrimination at UC Berkeley?
    • If split by department -> gender is roughly equal
    • If aggregate by all, more males than females admitted
  • Praise or Rebuke?
    • Praise if score >5; but students tend to do worst in next time
    • Rebuke if score <4; but students tend to do better next time
    • Regression to the mean
      • If mean is 5, then by random chance score will be better/worst
  • Evolutionary Psychology: Work from Satoshi Kanazawa
    • “Beautiful parents have more daughters (…)”
      • Study rated people’s attractiveness (Score 5 for very attractive)
      • Type 1 error (false positive)
      • Why compare group 5 to the rest?
      • Should be using multiple test correction
    • “Engineers have more sons, nurses have more daughters”
    • “Why you can’t get a date on a Saturday night and why most suicide bombers are muslim”

    To draw the rabbit out of the hat you always have to always have to have put it in beforehand -Jacques Lacan

  • Advice:
    • Be careful of observational studies
    • Things should be reproducible
    • Use randomized controlled experiments
    • Talk to subject experts

Continuous Data Validation for everyone

Continuous Data Validation for Everybody – Adrià Mercader

  • ckan - The open source data portal software
  • open knowledge international - a global non-profit network that promotes and shares information at no charge, including both content and data
  • Frictionless data package: standards & tooling
    • measurable improvement in how data is shared/consumed/analyzed
    • make it easier to maintain and improve data quality
  • Tabular Data Package
    • Data Package + Table Schema + CSV (datapackage.json + CSV)
  • Validate tabular table by
    • structure (eg. are there duplicate row? missing headers?)
    • comparing the data file against schema (eg is this field an integer?)
  • goodtables.io to bring continuous to everyone
    • Becomes a github check
    • goodtables.yml to define source, schema, delimiter, skip_rows

Data Tables | data.world

Keynote | Corporate Data Science

Building successful data science teams

Angela Bassa

  • Corporate vs Indie Data science
    • Corporate – it’s not personal, it’s business; you are working for the man
  • “Data is useless without context” @thegrugq (security engineer)
    • Context is everything
    • Once you see the cow, you can’t unsee it
  • Agenda
    • Data Science in Industry
    • Building effective data science teams
    • Proper care and maintenance of data teams

Data Science in Industry

  • Somethings don’t change:
    • data collection, wrangle, deploy, edit, modify, analyze, bugfix, …
    • Understand, Collection, Explore/Viz, Clear/Transform, Model, Validate, Communicate/Deploy
  • Bespoke solutions <-> introduction of production software
  • Some things change quite a bit
  • Blended objectives
    • Novel R&D - secret stuff
    • Legacy Production - money maker
    • Production stuff should never fail
    • When working on production scale, you should not break it
    • Legacy projects, could be pushed into production (with continuous validation) and can turn to novel stuff

Building effective data science teams

  • Drew Conway (@drewconway): Data Science 3-way venn diagram with Hacking + Math/Stats + Substantive Expert
    • aka Unicorn … but Unicorns do not exist
    • “Your intuition will only be accurate if your cumulative experience is a representative sample of reality” @adampiore
    • There are different beasts (horse, llama, zebras, …) that offer different gifts
  • Super Chickens (from xkcd): are the most egg laying chicken. When you put them together, in 6 generations, they kill each other
    • Dynasties and intellectual inbreeding
    • Should not build a team of “super chickens” – every single player should bring something different
    • Mirror-tocracies (Kara Swisher) & Survivorship Bias
  • Building a large reality
    • Hire for what the team needs (What are the gaps? Lack of overlaps?)
    • Imposter Syndrome:
  • There’s more than one way to skin a cat
    • Psychological safety and lower anxiety
    • Mistakes might have been made
  • Common Blind Spots:
    • Software engineers - sampling
    • Deep learnings - rigorous statistics (typically DL are from computer science background; they lack rigorous analyses such as math & topology)
    • PhDs - ship-it-ness (because you care for truth; for magic to take place … the product has to leave the bench, have to be marketed)
    • Statistician - Unit tests (by the logic of this equation it’s obvious – why need unit tests?)
    • Managers - P-hacking (managers keep the team running, but to make money, the answer needs to be this)
  • Multiple-Objective Decision Analysis (perspectives, seniority, backgrounds, languages, disciplines)
  • How to interview for best intersections:
    • How self aware of you? What did you learn? Can you tell of me about it?
    • Test communication skills

Proper care and maintenance of data teams

  • Data Team responsibilities:
    • Data Quality, Collection, Engineering, Security, Ethics, Communication, ROI
  • “Any sufficiently advanced negligence is indistinguishable from malice.” Deb Chachra (@debcha)
    • It’s not that no one cares, it’s that non-data folks unprepared to care
    • Not enough information
    • When should you start caring? Do it now
  • What you expect is different the current state of things
    • what exists is a system optimized for … not-analysis
    • But this is good, because the company has been optimized for something else
  • It is really easy to care, by leaving breadcrumbs
    • Documentation is how you “program” a business
      • Motivation
      • Reason to believe hypothesis
      • Dead ends
      • Pause projects
  • Figure out “What does the business do? Really?”
    • MacDonald is a real estate company
  • Having Junior Team Members is a great forcing function!
    • They are not jaded, and willing to ask senior members won’t ask

Q&A

  • Different company have different ways to value data assets
  • Documentation is to business is code is to product
    • Documentation is not yet generalizable
    • Atlassian Confluence, though mediocre solution, provides centralized indexing + Plugins