I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!
FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.
This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.
Hah that's classic politics
"Hello John Q. Public, here's all our data! It speaks for itself"
John Q. Public: "Wow, you really improved last few years homicide-wise"
"And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"
So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.
Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.
Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.
Thousands?! Do you have a public list on everything?
Have you had to fight a lot of malicious compliance which balloons up your request count? Or do they typically require an incredibly narrow request that you have up submit N entries per topic?
Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.
Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.
Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.
Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.
Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive
I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.
I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.
I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.
Does anyone have links to MTA payroll/hours/overtime related dataset?
or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc
Good news, the EU doesn't have any jurisdiction in NYC (or anywhere else outside of the EU) so they don't have the ability to enforce anything outside of their borders, as much as they would like you to believe otherwise.
You can enforce what people and companies do within your borders. You cannot enforce what companies or people outside of your borders do.
That may come as news to sanctioned Russians and various motley crypto types…
Isn’t the GDPR’s basic theory about jurisdiction that, if I’m sitting in New York City but routinely serving my web content to people in France, that service I’m providing relies on browsing intentions and tracking functions being executed by a user and on a machine in France, and therefore the meat of the “wrongdoing” is happening within their borders?
You can choose to do that the European way or not at all. And the local contests division of the NYC local transit authority is choosing “not at all.”
Isn’t this then a case of NYC complying with the EU’s express wishes for privacy by not “exporting” code they don’t want there?
Aren't most sanctions due to e.g. the US making it illegal for banks with a US presence to do business with sanctioned states/people? I don't think the US is telling some Polish bank that only operates in Poland and Russia that they need to stop doing business Russia, although they may sanction that bank as well if they don't.
I have no problem with voluntarily complying to GDPR-style privacy regulations because it's the right thing to do. Where I am able to make the decision, we store basically no user data beyond what's required to do whatever the user is trying to do.
My problem is the EU pretending that US companies must be fully GDPR-compliant because someone in France chooses to go to their website. At the end of the day, laws are only laws because you can enforce them. If I had a magic wand and could rob a bank but the police for some reason were unable to arrest me, the fact that bank robbery is illegal is merely semantic at that point. If I chose to flaunt GDPR non-compliance on a US-based website the EU would be impotent to do anything other than block the site, which wouldn't make me any more likely to suddenly become GDPR compliant.
It's a fiction and I probably wouldn't care about it nearly as much except it has essentially ruined the public internet with cookie banners everywhere.
Every time a cookie banner gets displayed on some non-EU resident's personal blog, a puppy dies.
> 3. Eligibility: The Challenge is open to legal residents of the United States. Entrants must be 18 years of age or older as of their date of entry. The Challenge is subject to federal, state, and local laws and regulations and is void where prohibited by law. Employees and contractors of the MTA, its subsidiaries, affiliates, and directors (collectively the “Employees”), as well as members of an Employee’s immediate family and/or those living in the same household, are ineligible to participate in the Challenge.
I mean … as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat. If you had just made this site a simple HTML page that just had the information the MTA wanted to convey on it, AIUI the EU doesn't have a problem.
Which … the MTA does appear to be, sending requests to Google, LinkedIn, and some other CDNs.
I also don't think the MTA has any EU presence, so what are they going to do?
> as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat
There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
> don't think the MTA has any EU presence, so what are they going to do?
Send letters. The MTA would be obligated to respond to them, which means legal bills.
> There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
> The MTA would be obligated to respond to them, which means legal bills.
…why would the MTA be obligated to respond to them? They've no jurisdiction/sovereignty over an American transit agency.
Why would they audit themselves against laws that don't apply to them? (Again, jurisdiction?) I've never worked for a company that audited itself against every law from every nation on Earth; we complied with the laws where we had a presence and did business.
Eh this conversation has nothing to do with people traveling on MTA services. We’re talking about people accessing the MTA website. Two different things.
Expect to see more of this, especially when the audience is local/US. IIRC, some newspapers are already doing region blocks. Why should website owners targeting US visitors spend _any_ amount of money making their content comply with asinine regulations (like cookie banners)?
I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!
FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.
This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.
I do this with my power company’s outage map: https://github.com/patricktrainer/entergy-outages
67k commits!
https://simonwillison.net/2020/Oct/9/git-scraping/
That's a really freaking neat trick. Thanks!
Hah that's classic politics "Hello John Q. Public, here's all our data! It speaks for itself" John Q. Public: "Wow, you really improved last few years homicide-wise" "And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"
So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.
I worked in open data for quite a few years. This is a very weird take.
Open data portals generally have data is useful form. FOI probably gives you PDFs.
"FOI probably gives you PDFs."
Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.
Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.
Thousands?! Do you have a public list on everything?
Have you had to fight a lot of malicious compliance which balloons up your request count? Or do they typically require an incredibly narrow request that you have up submit N entries per topic?
What an unappealing offer. No thanks.
Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.
Can you say more about this?
Well now we know why crime is down
I can only imagine. Many ETLs are already messy in companies with better tooling and processes.
Would love to read more about your experience with Open Data. Any place where I can reach out?
Here's something about shotspotter data in Chicago: https://x.com/foiachap/status/1775296597850480663
And this one makes some rounds: https://mchap.io/that-time-the-city-of-seattle-accidentally-...
Feel free to reach out!
But even if dataset is incomplete or not accurate, do you think we could at least get directionally right insights from such datasets?
Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.
Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.
We can do better than that.
Although pre-cleaned data is often not reflective of reality and requires careful work to use, often requiring a lot more knowledge of the field.
Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.
Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive
[1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...
I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.
Why would a cost center political institution enumerate all its problems? It is kind of miraculous they can engage with the public this way at all.
I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.
I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.
Does anyone have links to MTA payroll/hours/overtime related dataset?
or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc
perhaps this could be covered in a FOIA request
Time for someone to crack their knuckles and do a Power Broker-style MTA Open Data mashup :-)
https://en.wikipedia.org/wiki/The_Power_Broker
Tried building something with Cursor + Chatgpt in 30mins not bad for the initial exploration https://www.youtube.com/watch?v=w3mkXPdTVlI and the demo link: https://mtachallenge.streamlit.app/
Interesting, these open data challenges were all the rage 10 years ago. Wonder why the sudden trip down memory lane.
Some really nice example visualizations from Matt Yarri and Julia Lynn at the MTA: https://www.linkedin.com/posts/matt-yarri_some-of-the-data-w...
https://new.mta.info/article/introducing-subway-origin-desti...
I keep clicking on these 'MTA' articles expecting them to be about a "message transfer agent".
Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.
Hold my Metrocard.
Hold my bus transfer card.
The prize is very underwhelming. If they really want people to spend effort on it, they need to make the prize worth it.
Seems perfect actually! Attracts people that are interested in the subject matter, not just a proposed reward.
"we're hiring people that really love programming and aren't just in it for the money"
It will look great in your portfolio.
> “The winner will receive a vintage New York City Transit item from our memorabilia collection.”
Depends what it is. Long as it’s not something you could steal yourself. Ha!
One of the options is literally a trash can! https://new.mta.info/document/85441
Or perhaps... a subway seat? https://new.mta.info/document/85661
I’d give multiple weeks of time for a city trash can lol
Their collection of vintage gum scrapings perhaps?
Never underestimate the value of surplus NYC subway memorabilia to a transit enthusiast. Especially signage from retired rolling stock.
If you're doing it for the prize, then you're not the targeted audience :-)
IMO it deliberately establishes a tone. This challenge is for rail fans, it’s not a generalised “use our API” hackathon type thing.
Plus the MTA has a huge budget crunch. I really don’t think they could justify spending money on something with such an unclear outcome.
Even still it probably cost tens of thousands of dollars of staff time.
The prize is being able to say you won the prize on your resume. I assume a lot of college kids in data science are going to be going at this.
I think it actually sounds kinda cool, if it’s something unique that couldn’t just be purchased!
Why would you region block a webpage like this
> Why would you region block a webpage like this
As a part-time New York City taxpayer, I'd rather we not be paying EU lawyers to make sure the MTA's open data complies with European law.
Good news, the EU doesn't have any jurisdiction in NYC (or anywhere else outside of the EU) so they don't have the ability to enforce anything outside of their borders, as much as they would like you to believe otherwise.
You can enforce what people and companies do within your borders. You cannot enforce what companies or people outside of your borders do.
That may come as news to sanctioned Russians and various motley crypto types…
Isn’t the GDPR’s basic theory about jurisdiction that, if I’m sitting in New York City but routinely serving my web content to people in France, that service I’m providing relies on browsing intentions and tracking functions being executed by a user and on a machine in France, and therefore the meat of the “wrongdoing” is happening within their borders?
You can choose to do that the European way or not at all. And the local contests division of the NYC local transit authority is choosing “not at all.”
Isn’t this then a case of NYC complying with the EU’s express wishes for privacy by not “exporting” code they don’t want there?
Aren't most sanctions due to e.g. the US making it illegal for banks with a US presence to do business with sanctioned states/people? I don't think the US is telling some Polish bank that only operates in Poland and Russia that they need to stop doing business Russia, although they may sanction that bank as well if they don't.
I have no problem with voluntarily complying to GDPR-style privacy regulations because it's the right thing to do. Where I am able to make the decision, we store basically no user data beyond what's required to do whatever the user is trying to do.
My problem is the EU pretending that US companies must be fully GDPR-compliant because someone in France chooses to go to their website. At the end of the day, laws are only laws because you can enforce them. If I had a magic wand and could rob a bank but the police for some reason were unable to arrest me, the fact that bank robbery is illegal is merely semantic at that point. If I chose to flaunt GDPR non-compliance on a US-based website the EU would be impotent to do anything other than block the site, which wouldn't make me any more likely to suddenly become GDPR compliant.
It's a fiction and I probably wouldn't care about it nearly as much except it has essentially ruined the public internet with cookie banners everywhere.
Every time a cookie banner gets displayed on some non-EU resident's personal blog, a puppy dies.
In what circumstance do you imagine NYC tax money would go towards EU lawyers?
Reading their terms, I'm guessing it's due to:
> 3. Eligibility: The Challenge is open to legal residents of the United States. Entrants must be 18 years of age or older as of their date of entry. The Challenge is subject to federal, state, and local laws and regulations and is void where prohibited by law. Employees and contractors of the MTA, its subsidiaries, affiliates, and directors (collectively the “Employees”), as well as members of an Employee’s immediate family and/or those living in the same household, are ineligible to participate in the Challenge.
You can be a resident of the US and be on vacation for a couple weeks
yeah but wouldn't you want to create enough buzz globally so word of mouth can spread to more US entrants?
I don't disagree with you at all, I'm just speculating over why they'd block it.
https://web.archive.org/web/20240927144204/https://new.mta.i...
I can access it just fine from Sweden :shrug:
Because the next thing you know the EU is suing you for billions of Euros.
"Doctor it hurts…", IANAL.
I mean … as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat. If you had just made this site a simple HTML page that just had the information the MTA wanted to convey on it, AIUI the EU doesn't have a problem.
Which … the MTA does appear to be, sending requests to Google, LinkedIn, and some other CDNs.
I also don't think the MTA has any EU presence, so what are they going to do?
> as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat
There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
> don't think the MTA has any EU presence, so what are they going to do?
Send letters. The MTA would be obligated to respond to them, which means legal bills.
> There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)
> The MTA would be obligated to respond to them, which means legal bills.
…why would the MTA be obligated to respond to them? They've no jurisdiction/sovereignty over an American transit agency.
Why would they audit themselves against laws that don't apply to them? (Again, jurisdiction?) I've never worked for a company that audited itself against every law from every nation on Earth; we complied with the laws where we had a presence and did business.
> ...AIUI the EU doesn't have a problem
We're talking about a US transit agency. Even thinking about whether the EU has a problem with the agency's website is sort of absurd to begin with.
Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA to share their data with the whole world?
> Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA
Not how jurisdiction works.
Eh this conversation has nothing to do with people traveling on MTA services. We’re talking about people accessing the MTA website. Two different things.
Expect to see more of this, especially when the audience is local/US. IIRC, some newspapers are already doing region blocks. Why should website owners targeting US visitors spend _any_ amount of money making their content comply with asinine regulations (like cookie banners)?
Cookie banners are not a regulation requirement.
Contrary to what you seem to believe...There were more geoblocks when the EU law went into action a couple of years ago. There are less now.
> There were more geoblocks when the EU law went into action a couple of years ago. There are less now.
Source for that?
My personal experience.
EU cookie directive predates GDPR. Notices have long been required by that regulation for use of non-essential cookies.
[dead]
can someone share the data?
what a tragedy, this person never learned how to read.
Intersting challenge. Here is the NotebookLM Audio: MTA's Open Data program https://notebooklm.google.com/notebook/286a30b9-b17f-4dac-9e...