javascript - Node.js scraping brings back weird results with thepiratebay -
i building simple node.js server web scraping needs. thing is, when try load pirate bay, result looks this:
��[{s�6�;��nz��%y�����g����b��n����"h�����o�$r-{s�nj������<~u������yb����q09���&v�/�w<##��'���q}��t *|�?g��g�e��sg��%|m�l>8�9��+t�4� ��u���y�Ł�n}j�tܳ(�en9nh0c����\�������8��� �@q]��n��.�c���^dmyhg�4Ó�(��p 脱�o�r����8�0]|�j����k���m�_�_ߜ�y:��������|=��|u īz�7:f�@���wݪz|la2���p��
ȋ�����Н��y= �%k�^t��*�;\���6��uď��_���l��r�� ��{��m�!vt豀�t��ۄ���hm��j���|��/a;�v}#��w�z����lc_�hmȎ�!3���䠾�i����usp�)�������j_n=�l����%x�Ā ��������>����-= [pjc�v�v�ز]�x݅Ǎ0�*o��*|<"��+!8�_>%
a�g�i�e/ �s�ҝ
but longer. tried setting meta charset utf-8 didn't work. here main part of app.js:
app.get('/:key/:url', function(req, res) { // prevent bunch of people overloading server var key = req.params.key; if (key != '12345') res.send('error: incorrect key'); else { // scraping // slashes confuse system var url = ('http://' + req.params.url).replace('#', '/'); // res.send('successful'); request(url, function(err, response, html) { // if no error occurred if (!err) res.render('index', { output: html }); else res.send('error loading website'); }); } });
there no errors in command line. appreciated.
it looks encoding problem.
my suggestion firstly load site in browser (if can't, use proxy addon, there plenty) , investigate inside of browser identify encoding used on site. utf-8, better verify.
then question of why decoding in javascript did not work, rather it did not work, there somethign else.
you on server use simple shell tool wget url
see whether issue in node.js code or in network communication between server , site.
here example of using encoding against sites iconv module:
request({url: url, encoding: 'binary'}, function(error, response, html) { enc = charset(response.headers, html); if (!enc) { enc = jchardet.detect(html).encoding.tolowercase(); } if (enc != 'utf-8') { iconv = new iconv(enc, 'utf-8//translit//ignore') html = iconv.convert(new buffer(html, 'binary')).tostring('utf-8') } console.log(html); // debugging });
also, feel tryign make proxy server? presuming you, me, live in uk , site banned majesty. :) clever guy.
there far better ways build proxy server, method have issues loading supplementary resources (javascript, images, css).
Comments
Post a Comment