74910,74912c187768,187779 < [Example 1: If you want to use the code conversion f...

TZubiri · 2025-12-30T23:07:07 1767136027

In my experience with parsing PDFs, speed has never been an issue, it has always been a matter of quality.

DetroitThrow · 2025-12-30T23:39:52 1767137992

I tried a small PDF and got a memory error. It's definitely much faster than MuPDF on that file.

littlestymaar · 2025-12-31T08:10:09 1767168609

“The fastest PDF extractor is the one that crashes at the beginning of the file” or something.

lulzx · 2025-12-30T23:04:53 1767135893

fixed.

forgotpwd16 · 2025-12-30T23:20:41 1767136841

Yeah, sorry for confusion. When said Unicode, meant foreign text rather (just) the unescaped symbols, e.g. Greek. At one random Greek textbook[0], zpdf output is (extract | head -15):

  01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208 
  0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

  020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A 
  01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C 
 
  0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208 
  020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

This for entire book. Mutool extracts the text just fine.

[0]: https://repository.kallipos.gr/handle/11419/15087

lulzx · 2025-12-31T06:31:07 1767162667

works now!

ΑΛΕΞΑΝΔΡΟΣ ΤΡΙΑΝΤΑΦΥΛΛΙΔΗΣ Καθηγητής Τμήματος Βιολογίας, ΑΠΘ

     ΝΙΚΟΛΕΤΑ ΚΑΡΑΪΣΚΟΥ
     Επίκουρη Καθηγήτρια Τμήματος Βιολογίας, ΑΠΘ

     ΚΩΝΣΤΑΝΤΙΝΟΣ ΓΚΑΓΚΑΒΟΥΖΗΣ
     Μεταδιδάκτορας Τμήματος Βιολογίας, ΑΠΘ





     Γονιδιώματα
     Δομή, Λειτουργία και Εφαρμογές

forgotpwd16 · 2025-12-31T08:05:44 1767168344

Nice! Speed wasn't even compromised. Still 5x when benching. Also saw now there's page with tool compiled to wasm. Cool.

lulzx · 2025-12-31T08:31:39 1767169899

thanks! :)

lulzx · 2025-12-31T00:22:06 1767140526

sorry, I haven't yet figured out non-latin with tounicode references.

TZubiri · 2025-12-30T23:08:23 1767136103

Lol, but there's 100 competitors in the PDF text extraction space, some are multi million dollar industries: AWS textract, ABBY PDFreader, PDFBox, I think you may be underestimating the challenge here.