MTA Open Data Challenge

new.mta.info

218 points by oftenwrong 2 days ago

chaps a day ago

I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!

FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.

pjot a day ago

This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.
I do this with my power company’s outage map: https://github.com/patricktrainer/entergy-outages
67k commits!
https://simonwillison.net/2020/Oct/9/git-scraping/
- chaps a day ago
  
  That's a really freaking neat trick. Thanks!
amy-petrik-214 7 hours ago

Hah that's classic politics "Hello John Q. Public, here's all our data! It speaks for itself" John Q. Public: "Wow, you really improved last few years homicide-wise" "And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"
So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.
stevage 21 hours ago

I worked in open data for quite a few years. This is a very weird take.
Open data portals generally have data is useful form. FOI probably gives you PDFs.
- chaps 11 hours ago
  
  "FOI probably gives you PDFs."
  Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.
  Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.
  - 0cf8612b2e1e 5 hours ago
    
    Thousands?! Do you have a public list on everything?
    Have you had to fight a lot of malicious compliance which balloons up your request count? Or do they typically require an incredibly narrow request that you have up submit N entries per topic?
  - stevage 4 hours ago
    
    What an unappealing offer. No thanks.
bshep a day ago

Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.
- chaps a day ago
  
  Can you say more about this?
- whoiscroberts a day ago
  
  Well now we know why crime is down
kalendos a day ago

I can only imagine. Many ETLs are already messy in companies with better tooling and processes.
Would love to read more about your experience with Open Data. Any place where I can reach out?
- chaps a day ago
  
  Here's something about shotspotter data in Chicago: https://x.com/foiachap/status/1775296597850480663
  And this one makes some rounds: https://mchap.io/that-time-the-city-of-seattle-accidentally-...
  Feel free to reach out!
gordon_freeman a day ago

But even if dataset is incomplete or not accurate, do you think we could at least get directionally right insights from such datasets?
- chaps a day ago
  
  Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.
  Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.
  We can do better than that.
IanCal a day ago

Although pre-cleaned data is often not reflective of reality and requires careful work to use, often requiring a lot more knowledge of the field.

whitej125 a day ago

Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.

maxverse a day ago

Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive
[1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...
- exegete a day ago
  
  I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.
doctorpangloss a day ago

Why would a cost center political institution enumerate all its problems? It is kind of miraculous they can engage with the public this way at all.

slt2021 a day ago

I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.

I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.

Does anyone have links to MTA payroll/hours/overtime related dataset?

or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc

WUMBOWUMBO a day ago

perhaps this could be covered in a FOIA request

thecosas a day ago

Time for someone to crack their knuckles and do a Power Broker-style MTA Open Data mashup :-)

https://en.wikipedia.org/wiki/The_Power_Broker

shrikar 12 hours ago

Tried building something with Cursor + Chatgpt in 30mins not bad for the initial exploration https://www.youtube.com/watch?v=w3mkXPdTVlI and the demo link: https://mtachallenge.streamlit.app/

stevage 21 hours ago

Interesting, these open data challenges were all the rage 10 years ago. Wonder why the sudden trip down memory lane.

krebby a day ago

Some really nice example visualizations from Matt Yarri and Julia Lynn at the MTA: https://www.linkedin.com/posts/matt-yarri_some-of-the-data-w...

https://new.mta.info/article/introducing-subway-origin-desti...

nocman a day ago

I keep clicking on these 'MTA' articles expecting them to be about a "message transfer agent".

Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.

rayrrr a day ago

Hold my Metrocard.

onemoresoop a day ago

Hold my bus transfer card.

asjfkdlf 2 days ago

The prize is very underwhelming. If they really want people to spend effort on it, they need to make the prize worth it.

noitpmeder a day ago

Seems perfect actually! Attracts people that are interested in the subject matter, not just a proposed reward.
- maxverse a day ago
  
  "we're hiring people that really love programming and aren't just in it for the money"
  - 0cf8612b2e1e a day ago
    
    It will look great in your portfolio.
xtiansimon a day ago

> “The winner will receive a vintage New York City Transit item from our memorabilia collection.”
Depends what it is. Long as it’s not something you could steal yourself. Ha!
- jesterman a day ago
  
  One of the options is literally a trash can! https://new.mta.info/document/85441
  Or perhaps... a subway seat? https://new.mta.info/document/85661
  - erikaww a day ago
    
    I’d give multiple weeks of time for a city trash can lol
- mannyv a day ago
  
  Their collection of vintage gum scrapings perhaps?
nxobject a day ago

Never underestimate the value of surplus NYC subway memorabilia to a transit enthusiast. Especially signage from retired rolling stock.
zeroxfe a day ago

If you're doing it for the prize, then you're not the targeted audience :-)
afavour a day ago

IMO it deliberately establishes a tone. This challenge is for rail fans, it’s not a generalised “use our API” hackathon type thing.
Plus the MTA has a huge budget crunch. I really don’t think they could justify spending money on something with such an unclear outcome.
- stevage 21 hours ago
  
  Even still it probably cost tens of thousands of dollars of staff time.
IncreasePosts a day ago

The prize is being able to say you won the prize on your resume. I assume a lot of college kids in data science are going to be going at this.
corytheboyd a day ago

I think it actually sounds kinda cool, if it’s something unique that couldn’t just be purchased!

mcfedr a day ago

Why would you region block a webpage like this

JumpCrisscross a day ago

> Why would you region block a webpage like this
As a part-time New York City taxpayer, I'd rather we not be paying EU lawyers to make sure the MTA's open data complies with European law.
- pc86 a day ago
  
  Good news, the EU doesn't have any jurisdiction in NYC (or anywhere else outside of the EU) so they don't have the ability to enforce anything outside of their borders, as much as they would like you to believe otherwise.
  You can enforce what people and companies do within your borders. You cannot enforce what companies or people outside of your borders do.
  - alwa a day ago
    
    That may come as news to sanctioned Russians and various motley crypto types…
    Isn’t the GDPR’s basic theory about jurisdiction that, if I’m sitting in New York City but routinely serving my web content to people in France, that service I’m providing relies on browsing intentions and tracking functions being executed by a user and on a machine in France, and therefore the meat of the “wrongdoing” is happening within their borders?
    You can choose to do that the European way or not at all. And the local contests division of the NYC local transit authority is choosing “not at all.”
    Isn’t this then a case of NYC complying with the EU’s express wishes for privacy by not “exporting” code they don’t want there?
    
    pc86 14 hours ago
    
    Aren't most sanctions due to e.g. the US making it illegal for banks with a US presence to do business with sanctioned states/people? I don't think the US is telling some Polish bank that only operates in Poland and Russia that they need to stop doing business Russia, although they may sanction that bank as well if they don't.
    I have no problem with voluntarily complying to GDPR-style privacy regulations because it's the right thing to do. Where I am able to make the decision, we store basically no user data beyond what's required to do whatever the user is trying to do.
    My problem is the EU pretending that US companies must be fully GDPR-compliant because someone in France chooses to go to their website. At the end of the day, laws are only laws because you can enforce them. If I had a magic wand and could rob a bank but the police for some reason were unable to arrest me, the fact that bank robbery is illegal is merely semantic at that point. If I chose to flaunt GDPR non-compliance on a US-based website the EU would be impotent to do anything other than block the site, which wouldn't make me any more likely to suddenly become GDPR compliant.
    It's a fiction and I probably wouldn't care about it nearly as much except it has essentially ruined the public internet with cookie banners everywhere.
    Every time a cookie banner gets displayed on some non-EU resident's personal blog, a puppy dies.
- remram a day ago
  
  In what circumstance do you imagine NYC tax money would go towards EU lawyers?
safeimp a day ago

Reading their terms, I'm guessing it's due to:
> 3. Eligibility: The Challenge is open to legal residents of the United States. Entrants must be 18 years of age or older as of their date of entry. The Challenge is subject to federal, state, and local laws and regulations and is void where prohibited by law. Employees and contractors of the MTA, its subsidiaries, affiliates, and directors (collectively the “Employees”), as well as members of an Employee’s immediate family and/or those living in the same household, are ineligible to participate in the Challenge.
- n_plus_1_acc a day ago
  
  You can be a resident of the US and be on vacation for a couple weeks
- ratedgene a day ago
  
  yeah but wouldn't you want to create enough buzz globally so word of mouth can spread to more US entrants?
  - safeimp a day ago
    
    I don't disagree with you at all, I'm just speculating over why they'd block it.
kassner a day ago

https://web.archive.org/web/20240927144204/https://new.mta.i...
I can access it just fine from Sweden :shrug:
nemo44x a day ago

Because the next thing you know the EU is suing you for billions of Euros.
- deathanatos a day ago
  
  "Doctor it hurts…", IANAL.
  I mean … as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat. If you had just made this site a simple HTML page that just had the information the MTA wanted to convey on it, AIUI the EU doesn't have a problem.
  Which … the MTA does appear to be, sending requests to Google, LinkedIn, and some other CDNs.
  I also don't think the MTA has any EU presence, so what are they going to do?
  - JumpCrisscross a day ago
    
    > as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat
    There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
    > don't think the MTA has any EU presence, so what are they going to do?
    Send letters. The MTA would be obligated to respond to them, which means legal bills.
    
    deathanatos a day ago
    
    > There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
    > The MTA would be obligated to respond to them, which means legal bills.
    …why would the MTA be obligated to respond to them? They've no jurisdiction/sovereignty over an American transit agency.
    Why would they audit themselves against laws that don't apply to them? (Again, jurisdiction?) I've never worked for a company that audited itself against every law from every nation on Earth; we complied with the laws where we had a presence and did business.
  - returningfory2 a day ago
    
    > ...AIUI the EU doesn't have a problem
    We're talking about a US transit agency. Even thinking about whether the EU has a problem with the agency's website is sort of absurd to begin with.
    
    warkdarrior a day ago
    
    Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA to share their data with the whole world?
    
    JumpCrisscross a day ago
    
    > Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA
    Not how jurisdiction works.
    
    returningfory2 a day ago
    
    Eh this conversation has nothing to do with people traveling on MTA services. We’re talking about people accessing the MTA website. Two different things.
- cddotdotslash a day ago
  
  Expect to see more of this, especially when the audience is local/US. IIRC, some newspapers are already doing region blocks. Why should website owners targeting US visitors spend _any_ amount of money making their content comply with asinine regulations (like cookie banners)?
  - cinntaile a day ago
    
    Cookie banners are not a regulation requirement.
    Contrary to what you seem to believe...There were more geoblocks when the EU law went into action a couple of years ago. There are less now.
    
    cddotdotslash a day ago
    
    > There were more geoblocks when the EU law went into action a couple of years ago. There are less now.
    Source for that?
    
    cinntaile 21 hours ago
    
    My personal experience.
    
    kevin_thibedeau a day ago
    
    EU cookie directive predates GDPR. Notices have long been required by that regulation for use of non-essential cookies.

aaron695 a day ago

[dead]

sgtbr1 a day ago

can someone share the data?

manvillej a day ago

what a tragedy, this person never learned how to read.

leanthonyrn a day ago

Intersting challenge. Here is the NotebookLM Audio: MTA's Open Data program https://notebooklm.google.com/notebook/286a30b9-b17f-4dac-9e...