> We asked for a bill with the standard CPT codes. No reply. Asked again. “Oh, we meant to send it. We upgraded our computers five months ago and nothing works.” Uh-huh. Finally got the CPT codes.
I work in healthcare RCM.
I have no trouble believing the staff here that nothing in their system works.
We had a similar realization here at Thoughtful and pivoted towards code generation approaches as well.
I know the authors of Skyvern are around here sometimes --
How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?
From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
I’ve been working hard on our new component implementation (Vue/TS) to include accessibility for components that aren’t just native reskins, like combo and list boxes, and keyboard interactivity is a real pain. One of my engineers had it half-working on her dropdown and threw in the towel for MVP because there’s a lot of little state edge cases to watch out for.
Thankfully the spec as provided by MDN for minimal functionality is well spelled out and our company values meeting accessibility requirements, so we will revisit and flesh out what we’re missing.
Also I wanna give props (ha) to the Storybook team for bringing accessibility testing into their ecosystem as it really does help to have something checking against our implementations.
If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful.
If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.
- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness)
- Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in
At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.
devcontainers extension was a year out of date up until the last month or something? sorry, this is from memory, but definitely not 100% compatibility.
I haven't followed this closely, but I assumed it was related to a foreign entity having the ability to hyper-target content towards said 17 year olds (and the entire userbase in general) -- A modern form of psychological warfare.
The latest research I’ve pulled suggests that DEXA scans are fairly inaccurate and aren’t a reliable way to measure body composition even for the same person across time.
MRI is the gold standard, everything else is pretty loosely goosey.
Sorry, no references but this comes up pretty often in the science based lifting communities on Reddit and YouTube if you want to learn more.
Estimates in level of inaccuracy on the high end ranges from ~5% to ~10%
If you see your lean mass going up in DEXA, your muscles are getting larger, and you're getting stronger, particularly across a wide variety of exercises where CNS adaptation can't explain the strength gains, they're likely broadly accurate.
Mine have all tracked quite closely with what I'm seeing in the mirror and what is happening when it comes to the amount of weight I'm moving.
I work in this space and Claude's ability to count pixels and interact with a screen using precise coordinates seems like a genuinely useful innovation that I expect will improve upon existing approaches.
Existing approaches tend to involve drawing marked bounding boxes around interactive elements and then asking the LLM to provide a tool call like `click('A12')` where A12 remaps to the underlying HTML element and we perform some sort of Selenium/JS action. Using heuristics to draw those bounding boxes is tricky. Even performing the correct action can be tricky as it might be that click handlers are attached to a different DOM element.
Avoiding this remapping between a visual to an HTML element and instead working with high level operations like `click(x, y)` or `type("foo")` directly on the screen will probably be more effective at automating usecases.
That being said, providing HTML to the LLM as context does tend to improve performance on top of just visual inference right now.
So I dunno... I'm more optimistic about Claude's approach and am very excited about it... especially if visual inference continues to improve.
Agreed. In the short term (X months) I expect the HTML Distillation + giving text to LLMs to win out.. but the long term (Y years) screenshot only + pixels will definitely be the more "scalable" approach
One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page
For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.
Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)
> Claude's ability to count pixels and interact with a screen using precise coordinate
I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?
I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:
> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.
This is 3.5 Sonnet (their most current model).
And they explicitly call out spatial reasoning as a limitation:
> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.
Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.
They report that they trained the model to count pixels and based on accurate mouse clicks coming out of it, it seems to be the case for at least some code path.
> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.
Keep an eye out for the "Jump to Recipe" button found on most sites.
(It's still terrible UX IMO, but figured I'd share the above tip as it's saved me some frustration)
Sorry about all the broken plastic on the trim -- That's also very familiar...