Does anybody knows how to search for HTML-ENTITIES?

lutusp · on Aug 15, 2014

Are you asking for an explanation of HTML entities, or a listing of HTML entities that browsers recognize, or some third thing I cannot imagine?

Also, now that (new) Web pages are nearly all UTF-8 encoded, HTML entities are no longer used, sometimes even deprecated in favor of Unicode plain-text. The advantages are smaller size, easier editing, and a solution to the question about which browsers support which entities.

> But I think it's even more stupid spending more than 2 minutes trying to find out how to do a search.

So you think it should be someone else's two minutes? But seriously, ask yourself if you should even be considering using entities in 2014. Doesn't your HTML development environment support UTF-8?

ericol · on Aug 15, 2014

Thanks for your reply.

HTML-ENTITIES is an accepted value in the list of accepted encodings for mbstring related functions in PHP; some of the functions accept an encoding on one parameter or two whenever there is some sort of conversion (from one encoding to another).

I am trying to find some examples online (or in the php.net site) and explanation of its usage.

> Also, now that (new) Web pages are nearly all UTF-8 encoded, HTML entities are no longer used

I'm totally aware of UTF-8 and encodings; I'm responsible for fixing an incredible amount of encoding related bugs in the site.

But I'm working on a fairly large site, that sends the info to the browser in ISO-8859-1 (Can you feel my pain now??? :P ) and there's no way I can change that (at least not in one go). Legacy code and all that stuff. I cannot go to my boss and tell him "We need to change the whole encoding of the site" when all that he wants is this bug fixed, pronto.

I found a bug where some chars (Hungarian, in this case) are not properly shown in the page: ę >> &#amp;#281;

To add insult to injury, the strings are being truncated in some cases, so you could end up with Ksi&#amp;#2...

Finally, I don't think it should be someone else's 2 minutes: I was just asking if somebody knows the answer to that.

I hope this clarifies the matter, feel free to ask anything else that you think would help (I reckon I didn't put much information up front).

Also, I apologize if some of my sentences are difficult to understand; I'm not a native English speaker.

lutusp · on Aug 16, 2014

> I am trying to find some examples online (or in the php.net site) and explanation of its usage.

There are any number of examples of entities, and lists of them. They are a terrible hassle because not all browsers understand the same ones or treat them the same.

> I'm responsible for fixing an incredible amount of encoding related bugs in the site.

That should be fun. :) There will be times then you won't be able to decide whether you're fixing an error or introducing one.

> I found a bug where some chars (Hungarian, in this case) are not properly shown in the page: ę >> &#amp;#281;

The reason should be obvious -- the original code needed to be preserved unchanged, but a post-processor escaped the ampersand -- and incorrectly as well. I wish there were some fast and easy rules, preferably scriptable, but the examples you show are too varied, as though there was more than one cook in the kitchen (an English idiom).

I still think you should simply take out entities wherever you can and use ordinary Unicode characters. That also solves the problem of figuring out what prior editors had in mind -- assuming the resulting spelling is unambiguous. But you can also write regular expressions to solve most of the syntactically correct cases, including:

&(string);

-- and --

&#(number);

The first is obviously more difficult because you have to create an associative array (what Python calls a dictionary) to do the translations. The second case is easier, and I have seen example where the enclosed number was a normal Unicode code point, or a sequence of two.

Here is a big list of entities:

http://dev.w3.org/html5/html-author/charref

If you hover over each entry, the equivalent Unicode is given, so it seems multiple forms are embedded in the page. You could scrape the page and create a master list / translation table.

The final problem is that you will need to establish which encoding each page has, and don't mix encodings. From your comments, some pages are UTF-8 and some ISO-8859-1, and those two are obviously incompatible.

> Also, I apologize if some of my sentences are difficult to understand; I'm not a native English speaker.

As usual in cases like this (in my experience), your prose is better than that of many native English speakers.

Sok szerencsét!

ericol · on Aug 18, 2014

Thanks :)