Today is the first day of the CSV Conference in Portland and the following are notes I’ve taken regarding the talks I’ve attended:
- Smelly London Project
- Designing data exploration
- Better Bar graphics
- U.S. Open Data Policy (Keynote)
- Data and abuse of power
- Scratching Someone Else’s Itch
- Misleading people with data
- Continuous Data Validation for everyone
- data.world (Data tables)
- Corporate Data Science (Keynote)
Smelly London Project
Data/text mining of Medical Office Health (MOH) records of smells to interpret the health of 19th and 20th century Londoners
“Smelly London: visualising historical smells through text-mining, geo-referencing and mapping” – Deborah Leem
Follow @1208DL
- www.londonsmells.co.uk (project page)
- Produced dataset of smell-related words
- From 19th and 20th century London
- records written by the Medical Office of Health
- Created smell categories
- Used smell ontology/dictionary for categorization
- Implied smells and diseases
- There is correlation of smells and diseases
- Future work to use machine learning to get the implied smells
- Visualization of smells on the map of London
Designing data exploration
Design by exploring the data and then producing high level analyses (after asking a few key questions).
Designing data exploration: How to make large data sets accessible (and fun to use) – Simon Jockers
Follow @sjockers
Slides to go with my #csvconf talk: https://t.co/T2Sybf2g20
— Simon Jockers (@sjockers) May 2, 2017
- People are sitting on large datasets and there is a need for computer assisted reporting
- See German National Archive (Heimatkunde) for an instance of publishing a dataset instead of a story
- ProPublica is journalism in the public interest
- Start by exploring the data (use statistical packages, browse, & look at the data from different methods) and then produce high level analysis of the data
- Build tools (not graphics)
- Things to consider:
- What is the data? Tables? Graphs? Documents/text?
- Should this really be public? Is it in the public’s interest? Contain sensitive information? Will there be problems that may arise?
- Who are the users? What are their goals (interest in the topic)? Data literacy? Media consumption habits (eg. casual & on the go with lower attention; or high attention in office setting)?
- How should the data be displayed? Macro far view (to get the big picture/high-level overviews) or Micro near view (to look at individual data points; anecdotes or stories)
- Personalization
- Embedding by integrating visualization with narrative
- How to publish data & make it interesting?
Better Bar graphics
Data visualization has been used in different ways and can influence how the data is perceived by changing aesthetics.
Daniel Orbach The Journey to a better bar graph, and beyond – Daniel Orbach
Follow @Orbachdl
- Small decisions make large visual impact
- Visualization as documentation
- Christopher Scheiner 1630 AD (document sun spots)
- using “small multiples”
- Visualization as an opinion to express an argument
- William Playfair 1822 AD (price of wheat)
- Visualization for answers from big data
- John Snow 1855 AD (source of cholera is from single infected water hole)
- Data visualization evolves alongside information, complexity, and technology
- By changing the shape, colour, & rhythms of a data visualization, the visualization influences how the data is perceived
- Data visualization is only going to get harder
- eg. Dataviz, meet machine learning (such as the TensorBoard)
- Seeing critically is about seeing what is there and what is not there
- Questions to ask when designing/consuming data visualizations
- Should this be visualized?
- Less than 20 points, maybe use a table
- Way too complicated (too many variables & you don’t know what you are trying to show)
- What assumptions am I making?
- To some people, red=bad and green=good; maybe use neutral colours such as yellow and blue instead
- What am I trying to depict?
- Among items, over time, distribution, comparisons
- Should this be visualized?
- Orbach uses Logger to document train rides
- Orbach suggested to use Sketch as adobe editor
Keynote | U.S. Open Data Policy
Advocating for open data
Heather Joseph
Follow @hjoseph
- SPARC | www.sparcopen.org (organization, whose goal is to make open the default in research and education)
- There are a number of U.S. Information Policy Precedents, including OMB Circular A-130, “Management of Federal information resources.” (1996)
- The “Open Government Directive”
- Set expectations from day 1 of the Obama administration
- This directive is to make information available in open format
- see WayBack Machine (internet archive) of data.gov
- We need talent & visibility in leadership roles
- DJ Patil (the first US chief data scientist)
- DATA Act for Digital Accountability and Transparency
- Now we have to work through: Fake news, alternative facts, 404 errors (from pages being removed from websites), & gag orders
- What we can do
- Play active, pragmatic defence
- Data rescue events/ DataRefuge
- Recognize that progress is still possible
- Preserving data services active
- We have to understand what the new priorities are for this administration and that open transport is out
- Play active, pragmatic defence
- American Taxpayers are entitled to the Research they pay for
- Position open data as an enabling force
- eg. Making financial report data open and machine readable to make it easier for potential investors to create jobs
- “pdf is not searchable format”
- Individuals make a difference
- Speak out, step up, take action
Data and abuse of power
Types of data biases
Data & Abuse of Power – Moiz Syed
Follow @MoizSyed
- Trial and terror: The intercept
- Telling a false/biased story
- Spotting fake news
- Bias:
- We don’t collect it e.g. police shooting data
- We don’t collect it and you can’t either (eg. cut funding for gun violence; NFL brain injury and mental disabilities)
- We collect it but you can’t have it (eg. remove open Whitehouse data)
- We frame what we share
- how data is being aggregated
- how the data is sampled
- how the data is labeled
- The data is the bias
Scratching Someone Else’s Itch
An itch is a problem and a scratch is the process of solving the problem.
Scratching Someone Else’s Itch – Adam Hyde
Follow @CokoFoundation
Adams slides from #csvconf https://t.co/jlq7KLXE6o
— Coko Foundation (@CokoFoundation) May 3, 2017
- adamhyde.net (personal website)
- Coko | https://coko.foundation for solving scholarly publishing problems with open source
- Open-source is good for infrastructure, developer tools, but not for user-facing solutions
- Why has open source fail?
- Open source is not good at solving someone’s Itch
- the developer is the code specialist; but the user is the use case specialist
- Design First & Design with the User by sitting down with the user
- The Cabbage Tree methods | https://www.cabbagetree.org
- Good for: Building platforms, working with organizations, fixing workflows
- Facilitated Design
- Use case specialist to design their own products
- Iterative Design -> Build Sessions
- Example: Editoria
Misleading people with data
Analyses and visualizations of raw data can be misleading.
How to mislead the public – Philipp Burckhardt
Follow @burckhap
- Smaller schools do better on exams
- Looking at the standard deviation of the mean, higher variability are in smaller schools
- called the most “dangerous equation” by Howard Weiner
- possible solutions: standardize variables by subtracting overall mean and dividing by stdev_x to make them comparable
- Simpson’s paradox: Gender Discrimination at UC Berkeley?
- If split by department -> gender is roughly equal
- If aggregate by all, more males than females admitted
- Praise or Rebuke?
- Praise if score >5; but students tend to do worst in next time
- Rebuke if score <4; but students tend to do better next time
- Regression to the mean
- If mean is 5, then by random chance score will be better/worst
- Evolutionary Psychology: Work from Satoshi Kanazawa
- “Beautiful parents have more daughters (…)”
- Study rated people’s attractiveness (Score 5 for very attractive)
- Type 1 error (false positive)
- Why compare group 5 to the rest?
- Should be using multiple test correction
- “Engineers have more sons, nurses have more daughters”
- “Why you can’t get a date on a Saturday night and why most suicide bombers are muslim”
To draw the rabbit out of the hat you always have to always have to have put it in beforehand -Jacques Lacan
- “Beautiful parents have more daughters (…)”
- Advice:
- Be careful of observational studies
- Things should be reproducible
- Use randomized controlled experiments
- Talk to subject experts
Continuous Data Validation for everyone
Continuous Data Validation for Everybody – Adrià Mercader
Follow @amercader
Slides for my Continuous Data Validation talk https://t.co/mNiM7PgMl7 & the service can be tested at https://t.co/aVmzqrFzB6 #csvconf
— Adrià Mercader (@amercader) May 2, 2017
- ckan - The open source data portal software
- open knowledge international - a global non-profit network that promotes and shares information at no charge, including both content and data
- Frictionless data package: standards & tooling
- measurable improvement in how data is shared/consumed/analyzed
- make it easier to maintain and improve data quality
- Tabular Data Package
- Data Package + Table Schema + CSV (datapackage.json + CSV)
- Validate tabular table by
- structure (eg. are there duplicate row? missing headers?)
- comparing the data file against schema (eg is this field an integer?)
- goodtables.io to bring continuous to everyone
- Becomes a github check
- goodtables.yml to define source, schema, delimiter, skip_rows
Data Tables | data.world
- Datasets for Analysis & Download
- data.world
Keynote | Corporate Data Science
Building successful data science teams
Angela Bassa
Follow @AngeBassa
- Corporate vs Indie Data science
- Corporate – it’s not personal, it’s business; you are working for the man
- “Data is useless without context” @thegrugq (security engineer)
- Context is everything
- Once you see the cow, you can’t unsee it
- Agenda
- Data Science in Industry
- Building effective data science teams
- Proper care and maintenance of data teams
Data Science in Industry
- Somethings don’t change:
- data collection, wrangle, deploy, edit, modify, analyze, bugfix, …
- Understand, Collection, Explore/Viz, Clear/Transform, Model, Validate, Communicate/Deploy
- Bespoke solutions <-> introduction of production software
- Some things change quite a bit
- Blended objectives
- Novel R&D - secret stuff
- Legacy Production - money maker
- Production stuff should never fail
- When working on production scale, you should not break it
- Legacy projects, could be pushed into production (with continuous validation) and can turn to novel stuff
Building effective data science teams
- Drew Conway (@drewconway): Data Science 3-way venn diagram with Hacking + Math/Stats + Substantive Expert
- aka Unicorn … but Unicorns do not exist
- “Your intuition will only be accurate if your cumulative experience is a representative sample of reality” @adampiore
- There are different beasts (horse, llama, zebras, …) that offer different gifts
- Super Chickens (from xkcd): are the most egg laying chicken. When you put them together, in 6 generations, they kill each other
- Dynasties and intellectual inbreeding
- Should not build a team of “super chickens” – every single player should bring something different
- Mirror-tocracies (Kara Swisher) & Survivorship Bias
- Building a large reality
- Hire for what the team needs (What are the gaps? Lack of overlaps?)
- Imposter Syndrome:
Imposter Syndrome: be honest with yourself about what you know and have accomplished & focus less on the difference. pic.twitter.com/VTjS5KdR6Y
— David Whittaker (@rundavidrun) April 13, 2015
- There’s more than one way to skin a cat
- Psychological safety and lower anxiety
- Mistakes might have been made
- Common Blind Spots:
- Software engineers - sampling
- Deep learnings - rigorous statistics (typically DL are from computer science background; they lack rigorous analyses such as math & topology)
- PhDs - ship-it-ness (because you care for truth; for magic to take place … the product has to leave the bench, have to be marketed)
- Statistician - Unit tests (by the logic of this equation it’s obvious – why need unit tests?)
- Managers - P-hacking (managers keep the team running, but to make money, the answer needs to be this)
- Multiple-Objective Decision Analysis (perspectives, seniority, backgrounds, languages, disciplines)
- How to interview for best intersections:
- How self aware of you? What did you learn? Can you tell of me about it?
- Test communication skills
Proper care and maintenance of data teams
- Data Team responsibilities:
- Data Quality, Collection, Engineering, Security, Ethics, Communication, ROI
- “Any sufficiently advanced negligence is indistinguishable from malice.” Deb Chachra (@debcha)
- It’s not that no one cares, it’s that non-data folks unprepared to care
- Not enough information
- When should you start caring? Do it now
- What you expect is different the current state of things
- what exists is a system optimized for … not-analysis
- But this is good, because the company has been optimized for something else
- It is really easy to care, by leaving breadcrumbs
- Documentation is how you “program” a business
- Motivation
- Reason to believe hypothesis
- Dead ends
- Pause projects
- …
- Documentation is how you “program” a business
- Figure out “What does the business do? Really?”
- MacDonald is a real estate company
- Having Junior Team Members is a great forcing function!
- They are not jaded, and willing to ask senior members won’t ask
Q&A
- Different company have different ways to value data assets
- Documentation is to business is code is to product
- Documentation is not yet generalizable
- Atlassian Confluence, though mediocre solution, provides centralized indexing + Plugins