Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Honest question, what type of CSV to SQL code are you doing whre correctness doesn't count?

Perhaps my few decades in the industry have been in areas whree it is always the details, correctness, and fitness for purpose that tends to make those problems hard, not the work itself.

I do see a use case for throw away spikes, or part of a red-green-refactor, etc.. but if accuracy and correctness aren't critical, data cleanup is easy even without an LLM.



> I use LLMs for annoying shit that used to take an inordinate amount of time. For example an analyst gives me a CSV, it has some data clean up issues, eventually it needs to become a series of SQL inserts.

The CSV to SQL for analysts problem is a data integrity problem that is domain specific and not tool specific.

Remember that a 'relation' in relational databases is just a table, specifically named columns and tuples (rows).

A CSV is also just tuples (lines), but obviously SQL also typically has multiple normalized tables etc...

Typically bad data is worse than missing data.

For analysts, missing data can lead to bias and reduced statistical challenges, but methods exist and it can often be handled.

Bad data, on the other hand, can be misleading, deceptive and/or harmful. An LLM will be its very nature, be likely to produce bad data when cleaning.

The risk of using an LLM here is that it doesn't have context or nuances to deal with that. Data cleaning via (sed,grep,tr,awk), language tools or even ETL can work....

I promise you that fixing that bad data will be far worse.

But using it in a red-green-refactor model may help with the above, but you will actively need to be engaged and dig through what it produces.

Personally I find it takes more time to do that than to just view it as tuple repacking...and use my favorite tools to do so.

Data cleaning is hard, but it is the context specific details that make it so.


In my experience the difficulty in this kind of task is reading the docs of a bunch of packages I haven't used in months/years and probably won't use again anytime soon, testing things manually and creating all the little harnesses to make that work without running for minutes at a time, etc.

Sure for someone who does ETL type work all day, or often enough anyway, they'd scoff, and true LLM won't really save them time. But for me who does it once in a blue moon, LLMs are great. It's still on me to determine correctness, I am simply no longer contending with the bootstrap problem of learning new packages and their syntax and common usage.


Similarly for me, my visualisation pipeline changed from "relearn matplotlib and pandas every single time" to "ask for code, fix up details later". In this case the time saving scales with how much I forgot from the docs and the last time. I need to do the review and debugging either way, so that's moot.


It's not your fault their APIs suck!


There's two schools of thought here: viewing LLMs as machines to replace your thinking, and viewing LLMs as a vast corpus of compressed knowledge to draw upon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: