TL;DR: Ask for a line edit, "Line edit this Slack message / HN comment." It goes beyond fixing grammar (because it improves flow) without killing your meaning or adding AI-isms.
I recently got a Samsung device for testing, and the experience was terrible. It took three hours to get the device into a usable state.
First, it essentially forces you to create both a Samsung account and a Google account, with numerous shady prompts for "improving services" and "allowing targeted ads."
Then it required nine system updates (apparently, it can only update incrementally), and worst of all, after a while, it automatically started downloading bloatware like "Kawai" and other questionable apps, and you cannot cancel the downloads.
I wonder how much Samsung gets paid to preinstall all that crap. The phone wasn't cheap, either. The company seems penny wise and pound foolish.
While I think there's significant AI "offloading" in writing, the article's methodology relies on "AI-detectors," which reads like PR for Pangram. I don't need to explain why AI detectors are mostly bullshit and harmful for people who have never used LLMs. [1]
I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.
Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.
Max, there's two problems I see with your comment.
1) the paper didn't show a 0% FNR. I mean tables 4, 7, and B.2 are pretty explicit. It's not hard to figure out from the others either.
2) a 0% error rate requires some pretty serious assumptions to be true. For that type of result to not be incredibly suspect requires there to be zero noise in the data, analysis, and at all parts. I do not see that being true of the mentioned dataset.
Even high scores are suspect. Generalizing the previous a score is suspect if it is higher than the noise level. Can you truly attest that this condition is true?
I'm suspect that you're introducing data leakage. I haven't looked enough into your training and data to determine how that's happening but you'll probably need a pretty deep analysis as leakage is really easy to sneak in. It can do so in non obvious ways. A very common one is tuning hyper parameters on test results. You don't have to pass data to pass information. Another sly way for this to happen is that the test set isn't significantly disjoint from the training set. If the perturbation is too small then you aren't testing generalization you're testing a slightly noisy training set (which your training should be introducing noise to help regularize, so you end up just measuring your training performance).
Your numbers are too good and that's suspect. You need a lot more evidence to suggest they mean what you want them to mean.
EditLens (Ours)
Predicted Label
Human Mix AI
┌─────────┬─────────┬─────────┐
Human │ 1770 │ 111 │ 0 │
├─────────┼─────────┼─────────┤
True Mix │ 265 │ 1945 │ 28 │
Label ├─────────┼─────────┼─────────┤
AI │ 0 │ 186 │ 1695 │
└─────────┴─────────┴─────────┘
It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.
I guess I don’t see that this is much better than what’s come before, using your own paper.
Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error
It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.
I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.
In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.
There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.
Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.
Some people see a dozen extremely profitable, extremely destructive attempts at a problem as proof that the problem is not a place for charitable interpretation.
GAN.. Just feed the output of your algorithms back into the LLM while learning. At the end of the day the problem is impossible, but we're not there yet.
Pangram is trained on this task as well to add additional signal during training, but it's only ~90% accurate so we don't show the prediction in public-facing results
> Are you concerned with your product being used to improve AI to be less detectable?
The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.
Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.
They don't have an incentive to make their AIs better? If your product can genuinely detect AI writing, of course they would use it to make their models sound more human. The biggest criticism of AI right now is how robotic and samey it sounds.
It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.
Hi Max! Thank you for updating my mental model of AI detectors.
I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."
I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.
Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!
AI detectors are only harmful if you use them to convict people, it isn't harmful to gather statistics like this. They didn't find many AI written paper, just AI written peer reviews, which is what you would expect since not many would generate their whole paper submissions while peer reviews are thankless work.
If you have a bullshit measure that determines some phenomena (e.g. crime) to happen in some area, you will become biased to expect it in that area. It wrongly creates a spotlight effect by which other questionable measures are used to do the actual conviction (“Look! We found an em dash!”)
I think there is a funny bit of mental gymnastics that goes on here sometimes, definitely. LLM skeptics (which I'm not saying the Pangram folks are in particular) would say: "LLMs are unreliable and therefore useless, it's producing slop at great cost to the environment and other people." But if a study comes out that confirms their biases and uses an LLM in the process, or if they themselves use an LLM to identify -- or in many cases just validate their preconceived notion -- that something was drafted using an LLM, then all the sudden things are above board.
Me neither, but I remember that when searching for hotels and Airbnbs, I only filter for hotels that are 8+/10 domestically and 9+/10 internationally, which filters out many of the hotels that have those kinds of issues (and score doesn't affect budget much).
Booking.com has this grade inflation issue. if something is shit but you rate everything else fairly (things like location, staff friendliness, etc), the final score will be 7 or 8.. in summary: I had a lousy experience, 7/10!
It takes some experience to realize that a place graded 7.x probably has serious issues.
The problem here is that "mean" is a poor average. For hotels, if you're rating in 10 different categories, you really want a single 0/10 to bring the overall score down by way more than one point.
The opposite situation can also occur. At my university, entrance scholarships were decided a few years ago based on students' aggregate score across 25ish dimensions (I can't remember the exact number) where students were each rated 1-4. Consequently a student who was absolutely exceptional in one area would be beaten out by a student who was marginally above average in all the other areas. I suggested that rather than scoring 1-4 the scores should be 1/2/5/25 instead.
The problem here begins even before the mathematical issue - it's that web sites that live from listing bookings have an incentive to offer a way to delete reviews that are not in line with what the owner wants to see.
Honestly, the ratings on those sites are essentially useless anyway, because people are bad at reviewing.
I generally sample the lowest rating written reviews, to check if people are complaining about real stuff, or are just confused. For instance, if a hotel doesn't have a bar, some of the negative reviews will usually be about how the hotel doesn't have a bar; these can be safely ignored as having been written by idiots (it is not like the hotel is hiding the fact that it doesn't have a bar).
Occasionally some of the positive reviews are similarly baffling. Was recently booking a hotel in Berlin in January, and the top review's main positive comment about the hotel was that it had heating. Well, yeah, I mean, you'd hope so. I can only assume that the reviewer was a visitor from the 19th century.
The worst thing I’ve found with positive reviews is ones that are obviously fake/incentivized. I looked up reviews recently for a hotel that I used to stay at a lot for work, and had gone way downhill with many issues (broken ACs, mold, leaking ceilings, etc.). I was curious if they ever fixed their problems. I was at first surprised that they had a fairly positive overall review rating. But looking deeper, the many negative reviews were just crowded out by obviously fake reviews. Dead giveaways: every single one named multiple people by name. “Dave at the front desk was just so friendly and welcoming! Barbara the housecleaner did a fantastic job cleaning. And Steve the bartender just made my day! I love this hotel! 5 stars!” (Almost) nobody reviewing a hotel for real does that.
In the beginning of Android / iOS, just installing an app and registering was enough for the company to get your device's MAC address and thus your indoor location with accurate precision.
They could access your Wi-Fi network's BSSID (whose location is often public due to wardriving databases), and in public places, they had partner companies (malls, airports, etc.) whose routers would triangulate your position based on Wi-Fi signal strength and share information like "John is in the food court near McDonald's."
All of this happened without you even needing to connect to their Wi-Fi, because your phone used to broadcast its MAC address if the Wi-Fi was simply on. But now your MAC is now randomized, but it took a lot of time for Google / Apple to this.
What do you mean? The MAC address is used to identify the device within the same network segment. A program running on the device cannot derive location information just from the MAC address. It's a meaningless number. What the MAC address can do is make you visible to other devices in the same network segment. So for example, a wireless router can know you're nearby because your known MAC address has joined the network, but this is a problem regardless of what apps your phone is running.
That's what the GP was saying, I think. Once they get the MAC address, they can find you. Not via software on the phone, from exfiltrating and using shady third parties that collect data from access points, etc.
Okay, but if there's collusion between the app developers and external routers then it doesn't matter if the MAC is randomized. The app can still see the current MAC address and report it, and you can still be located, if nothing else, to within the range of a wireless router. Nothing is solved by randomizing the MAC address.
I've worked as an EM at four different companies, from large enterprises to small startups, and I think "the role of engineering manager" is a myth. Your role varies wildly from one company to another. In every company I've worked at, my job has never been the same:
In the end, engineering management basically requires you to counter-balance whichever of the four pillars your team needs most: Product, Process, People, and Programming.
- Too few people? You'll work on scope to make the deliverables meet reality. Since there's not much communication overhead, you'll be able to program.
- No PM? You now own the product pillar entirely. This takes a lot of your time: You'll need to validate features, prioritize the roadmap, and even talk directly with clients. None of the rest matters if your team is shipping features with no user value.
- Too many people in the team/company? Say goodbye to programming. You'll be responsible for careers, making everyone work cohesively, and navigating the org to get the right resources and support for your team.
- Reporting close to the CEO? You'll handle the bridge between sales, operations, client communications, and other functions.
The common thread is that your focus constantly shifts based on where your team's bottlenecks are. The key is identifying which pillar needs attention and adapting accordingly.
I feel like a lot of leadership positions are like this. I was a Principal Tech Lead at a 300 personal company and I did everything from PMing large tech teams, to collecting info from top users in spreadsheets, to building demos directly for the CEO, to building a key part of our tech used by over 100 other engineers.
I always told people I’d plunge the toilets myself if they were preventing the staff from working. I feel like the closer you get to top leadership the more your job becomes identifying and executing on whatever is highest value that you have the skills for.
> identifying and executing on whatever is highest value that you have the skills for
There's a hidden assumption there though, that you CAN actually do that. At least management skills mostly stick over time but even a year away from hands on technical work is going to leave you likely stranded and unable to execute on the technical aspects. Which is why I continue to push back against suggestions technical managers shouldn't be engaged hands on. Apart from being incredibly hostile to their own interests (it will be central to you getting hired to any future role), it also impairs one of the most strategic aspects of the role which can drastically affect the value you can deliver internally in the future as well.
> but even a year away from hands on technical work is going to leave you likely stranded and unable to execute on the technical aspects
This is an interesting myth, but certainly a myth. I guess if we consider technical skill to be intimate knowledge of the latest fad framework, that might be one source of the myth. But that's not technical skills, just trivia about an implementation detail.
The fundamentals like networking, process and memory management, databases and SQL, all change slowly and are very long-lived career-spanning knowledge.
Agreed, I haven’t seen this in my career at least. I’ve worked with contractors on a yearly basis who would take some time off and then hit the ground running.
If there’s any data supporting the opposite, I’d love to see it.
Kubernetes is not a fad. DynamoDB and MongoDB is not a fad. Golang is not a fad. These were all born in the last few decades, so they are rather new, and they will stay for equally long decades. And the list goes on and on... So all of those skills in your list mean nothing when it comes to these fundamental technical tools. They require an understanding on a completely different abstraction level which is equally complex as of those that you listed.
So if you don't have the understanding of these technologies when the project requires it, you are obsolete and you have no right to be in a leading position. And such fundamental technologies are born continuously.
So this myth that you can have fundamentals and that's enough is definitely untrue.
> Kubernetes is not a fad. DynamoDB and MongoDB is not a fad. Golang is not a fad.
These are indeed good examples of things that are merely tools, not fundamental knowledge.
Time-transport me an expert C programmer from the 80s and I'll have them productive in Go in two weeks. It's all very familiar territory.
O send me a mainframe programmer from the 60s and they'll be up to speed on kubernetes in short order. Pushing your workload to a remote cloud (mainframe) won't be exactly be new to them.
Databases have been studied and their properties understood for a very long time.
Sure, the exact details vary a bit and the command line options are different, but that's not significant.
Yeah, that sounds logical. Some of the most popular technologies of this time are just teeny-tiny tools, but some of long obsolete technologies and their attached skills which have no correlation to anything recent is somehow fundamental and has magical properties in your view :D Thanks for the good laugh!
> Databases have been studied and their properties understood for a very long time.
No they haven't. Noone ever considered schemaless databases or column-storage databases or vector databases for half a century after the birth of computing. So that kind of knowledge (relational DBs, etc in the 60s and 80s) meant nothing in light of these new technologies, and required completely different skills and knowledge.
But it's clear you are not familiar with these technologies, so it's a waste of time to engage with you now
If people think that not being hands-on for a year is unmanageable, then we as an industry are doing something horrifically wrong.
It would mean that no engineer could ever aspire to become a parent, take a sabbatical, further their education, or experiment with alternate career paths.
But I promise you that that is not actually the case. In fact, it is often the engineers who've stifled every other part of their life that are most likely to struggle in their mid-careers and beyond.
Yes, I don't mean actually taking time away - more organisationally, once you assume a role that is divorced from technical aspects and then try to come back to managing those without hands on experience. You will find that other more technically informed people rise up and start to become decision makers - you can't be authoritative any more and constantly have to ask someone else to give input on technical aspects since you aren't up to date with the current set of assumptions about it.
> I always told people I’d plunge the toilets myself if they were preventing the staff from working.
This is a lot closer to a literal interpretation of "shit rolls downhill, so a good manager will be a shit umbrella to protect their team" than I thought I'd ever see.
You have to be careful of the perceived politics around this. Tall poppies get cut down. I still don’t totally understand why but sometimes taking initiative doesn’t sit well with the folks who want their trains to run on time.
I think this varies from person-to-person or maybe organization-to-organization. I've definitely seen variations in the health of organizations but I think you can break up the categories used to judge them, e.g., meritocracy, overtime frequency, planning accuracy, psychological safety, value of work, etc.
I'd say the place I worked felt above average in meritocracy. In other words, it felt like folks sticking out to take initiative were more often rewarded than punished. I don't think we were perfect in every category though.
At least in small companies, my experience is that being adaptive like this applies to ICs as well as managers. Although to be fair the environment I'm thinking of doesn't have any full time managers.
I used them. Compression is an issue in other protocols (sending via WhatsApp, for example). Another benefit is that photos sent by Airdrop get automatically backed up. It also works well in areas with poor internet connectivity. For example, some beaches have weak cellphone signals due to their surroundings, so when meeting friends, we generally use Airdrop.
It really shows how LLMs work. It's all about probabilities, and not about understanding. If something looks very similar to a well known problem, the llm is having a hard time to "see" contradictions. Even if it's really easy to notice for humans.
Apple seems to be running out of steam. Xiaomi, "the Apple clone," is now releasing cars and XR devices. Meanwhile, it has been a while since Apple released a new product line. The last one was the Apple Vision Pro. With Apple Intelligence, they have shown that they can't "think different" anymore.
Sure, Apple will remain a trillion-dollar company for a long time, partly because its competition keeps shooting itself in the foot. Windows and Android are hostile towards power users and bloat the system with pre-installed apps, and they are both stepping on the gas.
But the real question is: how long can brand loyalty alone sustain the hype of new Apple products? And when will Apple stop being considered a "growth" company?
HN users will complain when a company tries to be all-consuming and unfocused then others deride them as uninspired when they are only focusing on a core business (which in this case still makes Apple endless amounts of money).
This is just some high fashion accessory they release, like a clothing company selling wallets on the side. It's not a big deal.
> Apple will remain a trillion-dollar company for a long time, partly because its competition keeps shooting itself in the foot
You must mean the competition was driven away. Apple and Google play a a dance that protects their profits, but Chinese companies didn't play by the same rules and were becoming too disruptive.
Yep, Apple has lost its way. Looking at the release of the latest OnePlus 15, the only reason to keep going with an iPhone is basically ecosystem lockin.
In my opinion they are losing on all front but the chips, and those have become of secondary importance in smartphone and about to get heated competition in laptops.
In the short term, Apple can work on their pricing, considering their large margins they have a lot of runway, but they need to find something to keep being the top dog.
I think OpenAI is screwed long-term, and their leadership knows it. Their most significant advantage was their employees, most of whom have now left for other companies. They're getting boxed in across every segment where they were previously the leader:
- Multimodality (browser use, video): To compete here, they need to take on Google, which owns the two biggest platforms and can easily integrate AI into them (Chrome and YouTube).
- Pricing: Chinese companies are catching up fast. It feels like a new Chinese AI company appears every day, slowly creeping up the SOTA benchmarks (and now they have multimodality, too).
- Coding and productivity tools: Anthropic is now king, with both the most popular coding tool and model for coding.
- Social: Meta is a behemoth here, but it's surprising how far they've fallen (where is Llama at?). This is OpenAI's most likely path to success with Sora, but history tells us AI content trends tend to fade quickly (remember the "AI Presidents" wave?).
OpenAI knows that if AGI arrives, it won't be through them. Otherwise, why would they be pushing for an IPO so soon?
It makes sense to cash out while we're still in "the bubble." Big Tech profits are at an all-time high, and there's speculation about a crash late next year.
I'd agree with all those facts about the competitive landscape, but in each of those competitors, there's enough wiggle room for me to think OpenAI isn't completely boxed in.
Google on multimodality: has been truly impressive over the last six months and has the deep advantages of Chrome, YouTube, and being the default web indexer, but it's entirely plausible they flub the landing on deep product integration.
Chinese companies and pricing: facts, and it's telling to me that OpenAI seems to have abandoned their rhetorical campaign from earlier this year teasing that "maybe we could charge $20000 a month" https://techcrunch.com/2025/03/05/openai-reportedly-plans-to....
Coding: Anthropic has been impressive but reliability and possible throttling of Claude has users (myself included) looking for alternatives.
Social: I think OpenAI has the biggest opportunity here, as OpenAI is closest to being a consumer oriented company of the model hyperscalers and they have a gigantic user base that they can take to whatever AI-based platform category replaces social. I'm somewhat skeptical that Meta at this point has their finger on the pulse of social users, and I think Superintelligence Labs isn't well designed to capitalize on Meta's advantages in segueing from social to whatever replaces social.
> an ipo is a way to seek more capital. they don't think they can achieve agi solely through private investment.
private deals are becoming bigger than public deals recently. so perhaps the IPO market is not a larger source of capital. different untapped capital, maybe, but probably not larger.
Unfortunately I think you are wrong. Their most important asset is the leadership role of the company, the brand name and the muscle memory. Other employers may come and go - on a system level this doesn’t look important as longer as they can replace talanted folks with other talanted ones. This seems to be the case for nowhere
Have to agree if services likes Deepseek remain free or at least extremely cheap I don’t see a long term profitability outlook for OpenAI. Gemini has also greatly improved and with Googles infrastructure and ecosystem … again long term outlook doesn’t look promising for OpenAI.
https://jampauchoa.substack.com/p/writing-with-ai-without-th...
TL;DR: Ask for a line edit, "Line edit this Slack message / HN comment." It goes beyond fixing grammar (because it improves flow) without killing your meaning or adding AI-isms.
reply