Wednesday, September 16, 2009

Character encoding in Web Development

If you have static HTML page, use the following tag in <head> section.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If you have a dynamic page (JSP or Servlet), use the following code
response.setCharacterEncoding("UTF-8");

If this method is called, web server will set the Content-Type HTTP header accordingly.
Content-Type: text/html; charset=UTF-8

Assume you have a page which accepts Russian characters.
Then according the standards you need to specify accept-charset="UTF-8" in form tag
<form accept-charset="UTF-8" >

However Internet explorer does not supports accept-charset. Nevertheless it is better to specify this attribute, howeve that is not sufficient.
However most of the browsers use the same encoding used for rendering the page (specified by Content-Type HTTP header), for encoding the forms submitted from the page. Hence specify the encoding for the page which contains the form (as mentioned above). This will force the browser to use the same encoding for form submission.
Browsers are supposed to send the Content-Type HTTP header along with HTTP reqeust for the form submission. however most of the browsers including IE and Firefox don't do so. Hence server side there is no way to acertain the encoding used by client.
Best work around is to use Content-Type (as mentined above) for the web pages containing forms and hence force the browser to use specific encoding scheme. Then specify the encoding at server while processing the reqeust by specifying
if(request.getCharacterEncoding() == null) request.setCharacterEncoding("UTF-8");

However there is another catch. If you want to submit a form with some other encoding scheme (for whateve reason - typically this can happen when you want to submit form to another website). Then it will be difficult from IE.
On the whole the best practice is to use always
response.setCharacterEncoding("UTF-8"); -- in JSP/Servlet
if(request.getCharacterEncoding() == null) request.setCharacterEncoding("UTF-8"); -- in JSP/Servlet
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> -- in static pages
<form accept-charset="UTF-8" > -- all forms

Allways use "UTF-8" as the encoding which is much better scheme than all other available schemes.

No comments: