转载

scrape-it: 一个 Node.js 的页面抓取工具

scrape-it: 一个 Node.js 的页面抓取工具

scrape-it

scrape-it: 一个 Node.js 的页面抓取工具 scrape-it: 一个 Node.js 的页面抓取工具 scrape-it: 一个 Node.js 的页面抓取工具 scrape-it: 一个 Node.js 的页面抓取工具 scrape-it: 一个 Node.js 的页面抓取工具 scrape-it: 一个 Node.js 的页面抓取工具

A Node.js scraper for humans.

scrape-it: 一个 Node.js 的页面抓取工具 Installation

$ npm i --save scrape-it

scrape-it: 一个 Node.js 的页面抓取工具 Example

const scrapeIt = require("scrape-it");  // Promise interface scrapeIt("http://ionicabizau.net", {     title: ".header h1"   , desc: ".header h2"   , avatar: {         selector: ".header img"       , attr: "src"     } }).then(page => {     console.log(page); });  // Callback interface scrapeIt("http://ionicabizau.net", {     // Fetch the articles     articles: {         listItem: ".article"       , data: {              // Get the article date and convert it into a Date object             createdAt: {                 selector: ".date"               , convert: x => new Date(x)             }              // Get the title           , title: "a.article-title"              // Nested list           , tags: {                 listItem: ".tags > span"             }              // Get the content           , content: {                 selector: ".article-content"               , how: "html"             }         }     }      // Fetch the blog pages   , pages: {         listItem: "li.page"       , name: "pages"       , data: {             title: "a"           , url: {                 selector: "a"               , attr: "href"             }         }     }      // Fetch some other data from the page   , title: ".header h1"   , desc: ".header h2"   , avatar: {         selector: ".header img"       , attr: "src"     } }, (err, page) => {     console.log(err || page); }); // { articles: //    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET), //        title: 'Pi Day, Raspberry Pi and Command Line', //        tags: [Object], //        content: '<p>Everyone knows (or should know)...a" alt=""></p>/n' }, //      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), //        title: 'How I ported Memory Blocks to modern web', //        tags: [Object], //        content: '<p>Playing computer games is a lot of fun. ...' }, //      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), //        title: 'How to convert JSON to Markdown using json2md', //        tags: [Object], //        content: '<p>I love and ...' } ], //   pages: //    [ { title: 'Blog', url: '/' }, //      { title: 'About', url: '/about' }, //      { title: 'FAQ', url: '/faq' }, //      { title: 'Training', url: '/training' }, //      { title: 'Contact', url: '/contact' } ], //   title: 'Ionică Bizău', //   desc: 'Web Developer,  Linux geek and  Musician', //   avatar: '/images/logo.png' }

scrape-it: 一个 Node.js 的页面抓取工具 Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params

  • String|Object url : The page url or request options.
  • Object opts : The options passed to scrapeHTML method.
  • Function cb : The callback function.

Return

  • Promise A promise object.

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

Params

  • Cheerio $ : The input element.
  • Object opts : An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:

    • listItem (String): The list item selector.
    • data (Object): The fields to include in the list objects:
      • <fieldName> (Object|String): The selector or an object containing:
        • selector (String): The selector.
        • convert (Function): An optional function to change the value.
        • how (Function|String): A function or function name to access the value.
        • attr (String): If provided, the value will be taken based on the attribute name.
        • trim (Boolean): If false , the value will not be trimmed (default: true ).
        • eq (Number): If provided, it will select the nth element.
        • listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.

    Example:

    {    articles: {        listItem: ".article"      , data: {            createdAt: {                selector: ".date"              , convert: x => new Date(x)            }          , title: "a.article-title"          , tags: {                listItem: ".tags > span"            }          , content: {                selector: ".article-content"              , how: "html"            }        }    } }

    If you want to collect specific data from the page, just use the same schema used for the data field.

    Example:

    {      title: ".header h1"    , desc: ".header h2"    , avatar: {          selector: ".header img"        , attr: "src"      } }

Return

  • Object The scraped data.

scrape-it: 一个 Node.js 的页面抓取工具 How to contribute

Have an idea? Found a bug? Seehow to contribute.

scrape-it: 一个 Node.js 的页面抓取工具 Where is this library used?

If you are using this library in one of your projects, add it in this list. scrape-it: 一个 Node.js 的页面抓取工具

  • ui-studentsearch (by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch

scrape-it: 一个 Node.js 的页面抓取工具 License

MIT © Ionică Bizău

原文  https://github.com/IonicaBizau/scrape-it
正文到此结束
Loading...