javascript - Node.js scraping brings back weird results with thepiratebay -


i building simple node.js server web scraping needs. thing is, when try load pirate bay, result looks this:

��[{s�6�;��nz��%y�����g����b��n����"h�����o�$r-{s�nj������<~u������yb����q09���&v�/�w<##��'���q}��t *|�?g��g�e��sg��%|m�l>8�9��+t�4� ��u���y�Ł�n}j�tܳ(�en9nh0c����\�������8��� �@q]��n��.�c���^dmyhg�4Ó�(��p 脱�o�r����8�0]|�j����k���m�_�_ߜ�y:��������|=��|u īz�7:f�@���wݪz|la2���p��ȋ�����Н��y= �%k�^t��*�;\���6��uď��_���l��r�� ��{��m�!vt豀�t��ۄ���hm��j���|��/a;�v}#��w�z����lc_�hmȎ�!3���䠾�i����usp�)�������j_n=�l����%x�Ā ��������>����-= [pjc�v�v�ز]�x݅Ǎ0�*o��*|<"��+!8�_>%a�g�i�e/ �s�ҝ

but longer. tried setting meta charset utf-8 didn't work. here main part of app.js:

app.get('/:key/:url', function(req, res) {      // prevent bunch of people overloading server      var key = req.params.key;      if (key != '12345')          res.send('error: incorrect key');       else {          // scraping           // slashes confuse system          var url = ('http://' + req.params.url).replace('#', '/');          // res.send('successful');           request(url, function(err, response, html) {           // if no error occurred              if (!err)              res.render('index', { output: html });           else             res.send('error loading website');         });     } }); 

there no errors in command line. appreciated.

it looks encoding problem.

my suggestion firstly load site in browser (if can't, use proxy addon, there plenty) , investigate inside of browser identify encoding used on site. utf-8, better verify.

then question of why decoding in javascript did not work, rather it did not work, there somethign else.

you on server use simple shell tool wget url see whether issue in node.js code or in network communication between server , site.

here example of using encoding against sites iconv module:

request({url: url, encoding: 'binary'}, function(error, response, html) {     enc = charset(response.headers, html);     if (!enc) {         enc = jchardet.detect(html).encoding.tolowercase();     }     if (enc != 'utf-8') {         iconv = new iconv(enc, 'utf-8//translit//ignore')         html = iconv.convert(new buffer(html, 'binary')).tostring('utf-8')     }     console.log(html); // debugging }); 

also, feel tryign make proxy server? presuming you, me, live in uk , site banned majesty. :) clever guy.

there far better ways build proxy server, method have issues loading supplementary resources (javascript, images, css).


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -