1 Star 0 Fork 22

echaya2022 / jsoup

forked from OpenHarmony-SIG / jsoup 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

jsoup

简介

  • 支持根据URL、HTML字符串、文件流、文件路径、rawfile路径获取及解析HTML;
  • 支持操作HTML元素、属性、文本;
  • 支持对HTML进行可信化操作。

preview.gif

下载安装

npm install @ohos/jsoup --save

OpenHarmony npm环境配置等更多内容,请参考 如何安装OpenHarmony npm包

使用说明

  1. 引入依赖
import { Jsoup, SanitizeHtml, Parser, DomHandler, Document, DomUtils } from '@ohos/jsoup'
  1. 解析HTML
const html = `
<!DOCTYPE html>
<html lang="en">
<head>
   <meta charset="UTF-8">
   <meta http-equiv="X-UA-Compatible" content="IE=edge">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   <title>Document</title>
</head>
<style>
   .tagh1{
       background-color: aquamarine;
       color:'blue';
   }
   .one-div{
       line-height: 30px;
   }
</style>
<body>
   <h1 class="tagh1">
       kkkk
       <p>hhhhh</p>
   </h1>
   <div style="color:red; height:100px;" class="one-div">cshi</div>
   <img src="https:baidu.com" alt="wwww"/>
   <p>wjdwekfe>>>>></p>
   <em>dsjfw<<<<<p
   <div>dksfmjk</div>
   owqkdo</em>
</body>
</html>
`

解析方式一:

const parser = new Parser.Parser({
  onopentag(name, attributes) {
    console.info(`jsoup onopentag name --> ${name}  attributes --> ${attributes}`)
  },
  ontext(text) {
    console.info("jsoup text -->", text);
  },
  onopentagname(name) {
    console.info("jsoup tagName -->", name);
  },
  onattribute(name, value) {
    console.info(`jsoup attribName name --> ${name}  value --> ${value}`)
  },
  onclosetag(tagname) {
    console.info("jsoup closeTag --> ", tagname);
  },
});
parser.write(html);
parser.end();



const handler = new DomHandler((error, dom) => {
  if (error) {
    // Handle error
  } else {
    // Parsing completed, do something
  }
});
const parser = new Parser.Parser(handler, { decodeEntities: true });
parser.write(html);
parser.end();

解析方式二:

let dom: Document = Parser.parseDocument(html)
  1. 获取Html
  • 获取HTML文本方式一:通过URL获取HTML文本
let httpRequest = http.createHttp()
httpRequest.request('http://106.15.92.248/share/html.txt')
  .then((data) => {
    console.log("jsoup url html=" + JSON.stringify(data))
    if (data.result && typeof data.result === 'string') {
      parser.write(data.result);
      parser.end();
    }
  })
  .catch((err) => {
    console.error('jsoup connect error:' + JSON.stringify(err));
  })
  • 获取HTML文本方式二:通过文件流获取HTML文本
var dom = Jsoup.parseHtmlFromFile(stream, html.length)
  • 获取HTML文本方式三:通过rawfile获取HTML文本
// 注意:需要先在MainAbility中为该变量赋值: globalThis.Context = this.context;
if (!globalThis.Context) {
  console.log('jsoup global Context is undefined');
  return;
}
var filePath = globalThis.Context.filesDir + '/testHtml.html';
globalThis.Context.resourceManager.getRawFile(filePath)
  .then((data) => {
    var textDecoder = new util.TextDecoder("utf-8", {
      ignoreBOM: true
    })
    var result: string = textDecoder.decode(data, {
      stream: false
    })
    console.log("jsoup getHtmlFromRawFile text=" + result);
    this.createFile(filePath);
    this.writeFile(filePath, result);
  })
  .catch((err) => {
    console.log("jsoup getHtmlFromRawFile err=" + err)
  })
  • 获取HTML文本方式四:通过文件路径获取HTML文本
 if (!globalThis.Context) {
   console.log('jsoup global Context is undefined');
   return;
 }
 var filePath = globalThis.Context.filesDir + '/testHtml.html';
 fileio.readText(filePath)
   .then((data) => {
     console.log("jsoup getHtmlFromFilePath text=" + data);
     parser.write(data);
     parser.end();
   })
   .catch((err) => {
     console.log("jsoup getHtmlFromFilePath err=" + err)
   })
  1. 提取HTML属性
// 提取CSS
Jsoup.parseCSS(html)

对解析过的Dom对象进行提取操作:

// 根据标签名称获取元素
let element = DomUtils.getElementsByTagName('style', dom)
// 获取文本
let text = DomUtils.getText(element)
// 判断元素是否为tag
let isTag = DomUtils.isTag(element[0])
// 判断元素是否为CDATA
let isCDATA = DomUtils.isCDATA(element[0])
// 判断元素是否Text
let isText = DomUtils.isText(element[0])
// 判断元素是否为Comment
let isComment = DomUtils.isComment(element[0])
// 获取指定元素的子元素
let childrens = DomUtils.getChildren(body[0])
  1. 清理HTML
const clean = SanitizeHtml('before <img src="test.png" /> after', {
    disallowedTagsMode: 'escape',
    allowedTags: [],
    allowedAttributes: false
})

接口说明

  1. 解析字符串类型的HTML

    方式一:

     interface ParserOptions {
       decodeEntities?: boolean;
       lowerCaseTags?: boolean;
       lowerCaseAttributeNames?: boolean;
       recognizeCDATA?: boolean;
       recognizeSelfClosing?: boolean;
     }
    
     interface Handler {
       onparserinit(parser: Parser): void;
       onreset(): void;
       onend(): void;
       onerror(error: Error): void;
       onclosetag(name: string): void;
       onopentagname(name: string): void;
       onattribute(name: string, value: string, quote?: string | undefined | null): void;
       onopentag(name: string, attribs: {
           [s: string]: string;
       }): void;
       ontext(data: string): void;
       oncomment(data: string): void;
       oncdatastart(): void;
       oncdataend(): void;
       oncommentend(): void;
       onprocessinginstruction(name: string, data: string): void;
    }
    const parser = new Parser.Parser(cbs: Partial<Handler> | null, options?: ParserOptions)
    parser.write(html)
    parser.end();

    方式二:

    parseDocument(data: string, options?: Options): Document
  2. 提取HTML属性

    DomUtils接口定义参照:Doc

    Jsoup.parseCSS(html: string): string 
  3. 根据文件流获取HTML

    Jsoup.parseHtmlFromFile(stream: fileio.Stream, htmlLength: number): string
  4. 清理HTML

     SanitizeHtml(dirty: string, options?: sanitize.IOptions): string
     
     可配置属性:
     interface Attributes { [attr: string]: string; }
     interface Tag { tagName: string; attribs: Attributes; text?: string | undefined; }
     type Transformer = (tagName: string, attribs: Attributes) => Tag;
     type AllowedAttribute = string | { name: string; multiple?: boolean | undefined; values: string[] };
     
     allowedAttributes?: Record<string, AllowedAttribute[]> | false;
     allowedStyles?: { [index: string]: { [index: string]: RegExp[] } };
     allowedClasses?: { [index: string]: boolean | Array<string | RegExp> }
     allowedIframeDomains?: string[];
     allowedIframeHostnames?: string[];
     allowIframeRelativeUrls?: boolean;
     allowedSchemes?: string[] | boolean;
     allowedSchemesByTag?: { [index: string]: string[] } | boolean;
     allowedSchemesAppliedToAttributes?: string[];
     allowedScriptDomains?: string[];
     allowedScriptHostnames?: string[];
     allowProtocolRelative?: boolean;
     allowedTags?: string[] | false;
     allowVulnerableTags?: boolean;
     textFilter?: ((text: string, tagName: string) => string);
     exclusiveFilter?: ((frame: IFrame) => boolean);
     nonTextTags?: string[];
     selfClosing?: string[];
     transformTags?: { [tagName: string]: string | Transformer };
     parser?: ParserOptions;
     disallowedTagsMode?: discard' | 'escape' | 'recursiveEscape;
     enforceHtmlBoundary?: boolean;

兼容性

支持 OpenHarmony API version 9 及以上版本。

目录结构

|---- jsoup  
|     |---- entry  # 示例代码文件夹
|        |----src
|           |----addTag.ets
|           |----index.ets
|     |---- jsoup  # jsoup库文件夹
|	    |----src
|         |----main
|             |----ets
|                 |----common 模板
|                 |----Cleaner.ts #html clean
|                 |----Jsoup.ts #html解析
|           |---- index.ts  # 对外接口
|     |---- README.md  # 安装使用方法

贡献代码

使用过程中发现任何问题都可以提 Issue 给我们,当然,我们也非常欢迎你给我们发 PR

开源协议

本项目基于 MIT ,请自由地享受和参与开源。

The MIT License Copyright (c) 2009-2022 Jonathan Hedley <https://jsoup.org/> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/echaya2022/jsoup.git
git@gitee.com:echaya2022/jsoup.git
echaya2022
jsoup
jsoup
master

搜索帮助