python、java、ruby、node等如何提取office文档中的内容?

我相信大家都有过这样的需求,把doc、ppt、excel、pdf、txt中的文本内容提取出来。提取出来的文本内容可用于文档内容的全文索引,文档的基本内容摘要等。在度娘上搜索“如何提取文档内容”,确实有很多demo可以借鉴,但是,很多demo要么是需要付费的jar包,要么提取出的内容不全或者乱码。Java有许多开源工具包可用,尚且还不完美,何况其它一些开发语言如node、golang、ruby、python呢!

如果能有免费API接口可以调用,那就不管是啥语言了,省时又省心,何乐而不为呢!基于此,我百度了下,没想到还真找到了。所以在此记录下,也希望能帮助到大家。

接口详细说明: https://www.xiaocongjisuan.co…

接口介绍:提取word、ppt、pdf等文档中的内容返回给接口调用者

接口详细介绍中已经说的很清楚了,下面我简单说下注意事项:

1、关于appKey和openId

appKey:接口唯一标识,在用户后台->应用中心->我的接口查看

openId:平台id,注册后系统自动生成,在用户后台->用户中心->账户信息查看

2、关于文件传值

接口传值统一采用base64编码,由于get请求参数有长度限制,所以在调用api接口时,必须使用POST方式。【文档base64的编码值,请保持数据的完整性(可不带数据头),以TXT为例如:[“data:text/plain;base64,aHR0cHM6…”] 或 [“aHR0cHM6…”]】

3、错误码

接口返回的错误码,具体内容请查看接口详细说明

同时接口说明中也有各种开发语言的调用DEMO,如:java、python、php、c#、golang、nodeJS,其实不止上述这些语言,只要可以发出POST请求,就可以使用该接口,非常的方便。这个平台还有很多其它功能的接口,大部分是免费的。像什么天气预报、万年历、老黄历、中文分词、电影数据查询、电子书查询、网盘数据等等,大家慢慢去发现吧!下面贴一下文档内容提取的相关代码

JAVA版本:

package com.xiaocongjisuan.module.example;

import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

public class Application {
    
     public static final String DEF_CHATSET = "UTF-8";
     public static final int DEF_CONN_TIMEOUT = 30000;
     public static final int DEF_READ_TIMEOUT = 30000;
     public static String userAgent =  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36";
     
     //配置您申请的appKey和openId
     public static final String APP_KEY ="yours";
     public static final String OPEN_ID ="yours";
     
     //将map型转为请求参数型
     public static String urlEncode(Map<String,Object> params) {
        
        if(params==null){return "";};
         
        StringBuilder sb = new StringBuilder();
        for (Map.Entry<String,Object> i : params.entrySet()) {
            try {
                sb.append(i.getKey()).append("=").append(URLEncoder.encode(i.getValue()+"","UTF-8")).append("&");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }
        String r=sb.toString();
        if(r.endsWith("&")){
            r = r.substring(0,r.length()-1);
        }
        return r;
     }
     
     /**
     *
     * @param requestUrl 请求地址
     * @param params 请求参数
     * @param method 请求方法
     * @return 请求结果
     * @throws Exception
     */
     public static String requestContent(String requestUrl, Map<String,Object> params,String method) throws Exception {
        
        HttpURLConnection conn = null;
        BufferedReader reader = null;
        String rs = null;
        try {

            //组装请求链接
            StringBuffer sb = new StringBuffer();
            
            if(method!=null&&method.equalsIgnoreCase("get")){
                requestUrl = requestUrl+"?"+urlEncode(params);
            }

            //默认get
            URL url = new URL(requestUrl);
            conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");
            
            if(method!=null&&method.equalsIgnoreCase("post")){
                 conn.setRequestMethod("POST");
                 conn.setDoOutput(true);
                 conn.setDoInput(true);
            }

            //参数配置
            conn.setRequestProperty("User-agent", userAgent);
            conn.setUseCaches(false);
            conn.setConnectTimeout(DEF_CONN_TIMEOUT);
            conn.setReadTimeout(DEF_READ_TIMEOUT);
            conn.setInstanceFollowRedirects(false);
            conn.connect();
            
            if (params!= null && method.equalsIgnoreCase("post")) {
                try {
                    DataOutputStream out = new DataOutputStream(conn.getOutputStream());
                    out.writeBytes(urlEncode(params));
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
            
            //读取数据
            InputStream is = conn.getInputStream();
            reader = new BufferedReader(new InputStreamReader(is, DEF_CHATSET));
            String strRead = null;
            while ((strRead = reader.readLine()) != null) {
                sb.append(strRead);
            }
            rs = sb.toString();
            
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                reader.close();
            }
            if (conn != null) {
                conn.disconnect();
            }
        }
        return rs;
    }
    
    
    public static void main(String[] args) throws Exception{

        String domain="http://api.xiaocongjisuan.com/";
        String servlet="develop/doccontent/analysis";
        String method="post";
        
        String requestUrl=domain+servlet;
        Map<String,Object> params=new HashMap<String,Object>();
        params.put("appKey",APP_KEY);
        params.put("openId",OPEN_ID);

        String base64Doc="UEsDBBQABgAIAAAAIQDd/JU3ZgEAACAFAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIootuwjAQvFfqP0S+Vomhh6qqCBz6OLZIpR9g7A1Y9Uv28vr7bgJEVQtBKuUSKVnvzOzsxIPR2ppsCTFp70rWL3osAye90m5Wso/JS37PsoTCKWG8g5JtILHR8PpqMNkESBl1u1SyOWJ44DzJOViRCh/AUaXy0Qqk1zjjQchPMQN+2+vdcekdgsMcaww2HDxBJRYGs+c1fd4qiWASyx63B2uukokQjJYCSSlfOvWDJd8xFNTZnElzHdINyWD8IENdOU6w63sja6JWkI1FxFdhSQZf+ai48nJhaYaiG+aATl9VWkLbX6OF6CWkRJ5bU7QVK7Tb6z+qI+HGQPp/FVvcLnrSOY4+JE57OZsf6s0rUDlZESCihnZ1x0cHRLLsEsPvkLvGb1KAlHfgzbN/tgcNzEnKin6JiZgaOJvvV/Ja6JMiVjB9v5j738C7hLT5kz7+wYz9dVF3H0gdb+634RcAAAD//wMAUEsDBBQABgAIAAAAIQAekRq38wAAAE4CAAALAAgCX3JlbHMvLnJlbHMgogQCKKAAAgjJLbSgNBDIbvBd9hyH032woi0tneSKF3IusDhJnsAXcOzKTavr2jILpQ217m9OfLT9abg5vUO6c8Bq9hWdWg2JtgR99reG23iwdQWchbmoJnDUfOsGlub9YvPJGUoTyMMaui4rOGQSQ+ImYzsKNchci+VLqQHEkJU4+RzBv1jKu6vsf0VwOamabaWQ1pZ+9AtcdYNl/WDl03Gn4KZu/Yy4kVyAdhb9kuYipsScZyjWop9SwabDDPJZ2RYqwKNuBpotX1RP9fi46FLAmhCYnP83x1nANaXg902aJ5x687HyFZLBZ9e/tDg7MvaD4BAAD//wMAUEsDBBQABgAIAAAAIQDWZLNR+gAAADEDAAAcAAgBd29yZC9fcmVscy9kb2N1bWVudC54bWwucmVscyCiBAEooAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKySzWrDMBCE74W+g9h7LTv9oYTIuZRArq37AIq9/qGyJLSbtn77CkNShwb34otgRmjmk7Sb7XdvxCcG6pxVkCUpCLSlqzrbKHgvdnfPIIi1rbRxFhUMSLDNb282r2g0x0PUdp5ETLGkoGX2aympbLHXlDiPNu7ULvSaowyN9Lr80A3KVZo+yTDNgPwiU+wrBWFf3YMoBh+b/892dd2V+OLKY4+Wr1TILzy8IXO8HMVYHRpkBRMzibQgr4OslgShPxQnZw4hWxSBBxM/8/wMNOq5+scl6zmOCP62j1KOazbH8LAkQ+0sF/pgJhxn6wQhLwY9/wEAAP//AwBQSwMEFAAGAAgAAAAhAMCvIFtNAgAAmgQAABEAAAB3b3JkL2RvY3VtZW50LnhtbJxUy47TMBTdI/EPVfZtkrZTOlHb0ZQybECqZuAD3MRJLPyIbKehrBASsACpsGBYj9gXWLBA4m9QS3f8AtdOUgpIo4pVcu/1Oee+7MHJY0YbcywVEXzo+C3PaWAeiojwZOg8fHDW7DsNpRGPEBUcD50FVs7J6OaNQRFEIswZ5roBFFwFc4imWmeB66owxQyplsgwh2AsJEMaTJm4DMlHedYMBcuQJjNCiV64bc/rORWNGDq55EFF0WQklEKJWBtIIOKYhLj61Ah5iG6JnFQpW0VXYgo5CK5Skqmajf0vG5SY1iTz64qYM1qfK7JD1CKJCpgHo2XahZBRJkWIlQLvpAzuGH3vOu2qgYZihzgkhT8160wYInxHY7bjr/nvhteC4bmltmuofhcCvRjBLs1EtDDfrFEEsIvR+dDxvG7nqH86dmrXBMcop9pEbo+94/GdOjI1rknHnxx5liybSsOlMhRCgwBOiUmu3YblLo3znIID6XsYKe24o4EL0iXKQmX1fya4VoBJCQddc/hUEVQBqkN6tP603L79vF1d/Vi9//nt9eby5ebqw/rF8/Xq62b5Zr283Hx5tf347vvTZ0ZHWzWQMZo2Txzqqayr/LfwC4jveW2bkosnACjgxvrH5u5AivDf63f6JjkgTu4jQ6kFbJjf7drKJUlSYPL7njVnQmsB616HKY73oilGEYardcuDF6AIYiH0npnk2ppeKRcKatpkOg59NRCbBbwQdyWJIGImMCU6hCw7PQuC6pUt3DainD/46kdl9AsAAP//AwBQSwMEFAAGAAgAAAAhAMccbRScBgAAURsAABUAAAB3b3JkL3RoZW1lL3RoZW1lMS54bWzsWU1vG0UYviPxH0Z7b2MndhpHdarYsRto00axW9TjeD3enXp2ZzUzTuobao9ISIiCeqAS4sIBAZVaCSTKr0kpKkXqX+Cdmd31TrwmSRtBBfUh8c4+7/fHvDO+eOlOxNA+EZLyuOlVz1c8RGKfD2kcNL0b/e65NQ9JheMhZjwmTW9KpHdp4/33LuJ1FZKIIKCP5TpueqFSyfrSkvRhGcvzPCExvBtxEWEFjyJYGgp8AHwjtrRcqawuRZjGHopxBGyvj0bUJ+jZz7+8+OaBt5Fx7zAQESupF3wmepo3cUgMdjiuaoScyjYTaB+zpgeChvygT+4oDzEsFbxoehXz8ZY2Li7h9ZSIqQW0Bbqu+aR0KcFwvGxkimCQC612a40LWzl/A2BqHtfpdNqdas7PALDvg6VWlyLPWnet2sp4FkD26zzvdqVeqbn4Av+VOZ0brVar3kh1sUwNyH6tzeHXKqu1zWUHb0AWX5/D11qb7faqgzcgi1+dw3cvNFZrLt6AQkbj8RxaB7TbTbnnkBFn26XwNYCvVVL4DAXZkGeXFjHisVqUaxG+zUUXABrIsKIxUtOEjLAPadzG0UBQrAXgdYILb+ySL+eWtCwkfUET1fQ+TDCUxIzfq6ffv3r6GB3efXJ496fDe/cO7/5oGTlU2zgOilQvv/3sz4cfoz8ef/3y/hfleFnE//bDJ89+/bwcCOUzU+f5l49+f/Lo+YNPX3x3vwS+KfCgCO/TiEh0jRygPR6BYcYrruZkIE5H0Q8xLVJsxoHEMdZSSvh3VOigr00xS6Pj6NEirgdvCmgfZcDLk9uOwr1QTBQtkXwljBzgDuesxUWpF65oWQU39ydxUC5cTIq4PYz3y2S3cezEtzNJoG9maekY3g6Jo+Yuw7HCAYmJQvodHxNSYt0tSh2/7lBfcMlHCt2iqIVpqUv6dOBk04xom0YQl2mZzRBvxzc7N1GLszKrt8i+i4SqwKxE+T5hjhsv44nCURnLPo5Y0eFXsQrLlOxNhV/EdaSCSAeEcdQZEinLaK4LsLcQ9CsYOlZp2HfYNHKRQtFxGc+rmPMicouP2yGOkjJsj8ZhEfuBHEOKYrTLVRl8h7sVop8hDjheGO6blDjhPr4b3KCBo9IsQfSbidCxhFbtdOCIxn/XjhmFfmxz4OzaMTTA5189LMmst7URb8KeVFYJ20fa7yLc0abb5mJI3/6eu4Un8S6BNJ/feN613Hct1/vPt9xF9XzSRjvrrdB29dxgh2IzIkcLJ+QRZaynpoxclWZIlrBPDLuwqOnM8ZDkJ6YkhK9pX3dwgcCGBgmuPqIq7IU4gQG76mkmgUxZBxIlXMLBziyX8tZ4GNKVPRbW9YHB9gOJ1Q4f2uUVvZydC3I2ZrcJzOEzE7SiGZxU2MqFlCmY/TrCqlqpE0urGtVMq3Ok5SZDDOdNg8XcmzCAIBhbwMurcEDXouFgghkZar/bvTcLi4nCWYZIhnhI0hhpu+djVDVBynLF3ARA7pTESB/yjvFaQVpDs30DaScJUlFcbYG4LHpvEqUsg2dR0nV7pBxZXCxOFqODpteoL9c95OOk6Y3gTAtfowSiLvXMh1kAN0O+Ejbtjy1mU+WzaDYyw9wiqMI1hfX7nMFOH0iEVFtYhjY1zKs0BVisJVn9l+vg1rMywGb6a2ixsgbJ8K9pAX50Q0tGI+KrYrALK9p39jFtpXyiiOiFwwM0YBOxhyH8OlXBniGVcDVhOoJ+gHs07W3zym3OadEVb68Mzq5jloQ4bbe6RLNKtnBTx7kO5qmgHthWqrsx7vSmmJI/I1OKafw/M0XvJ3BTsDLUEfDhHldgpOu16XGhQg5dKAmp3xUwOJjeAdkCd7HwGpIKbpPNf0H29X9bc5aHKWs48Kk9GiBBYT9SoSBkF9qSyb5jmFXTvcuyZCkjk1EFdWVi1R6QfcL6ugeu6r3dQyGkuukmaRswuKP55z6nFTQI9JBTrDenh+R7r62Bf3ryscUMRrl92Aw0mf9zFUt2VUtvyLO9t2iIfjEbs2pZVYCwwlbQSMv+NVU45VZrO9acxcv1TDmI4rzFsJgPRAnc9yD9B/Y/KnxGTBrrDbXP96C3IvihQTODtIGsPmcHD6QbpF0cwOBkF20yaVbWtenopL2WbdZnPOnmco84W2t2knif0tn5cOaKc2rxLJ2detjxtV1b6GqI7NEShaVRdpAxgTG/aRV/deKD2xDoLbjfnzAlTTLBb0oCw+jZM3UAxW8lGtKNvwAAAP//AwBQSwMEFAAGAAgAAAAhAKTIKmgQAwAAAwcAABEAAAB3b3JkL3NldHRpbmdzLnhtbJxV21LbMBB970z/wePnhviShODBYUhCehmgnRo+QLaVWINuI8kx4eu7si0MNGWYPlk6Z/dod7Van188MurtsdJE8NQPTwLfw7wQJeG71L+/24zmvqcN4iWiguPUP2DtXyw+fzpvEo2NATPtgQTXiUj9WvFEFxVmSI8YKZTQYmtGhWCJ2G5JgfuP33uo1K+Mkcl43DudCIk5qG2FYsjoE6F2485zLYqaYW7GURDMxgpTZCBgXRGpnRr7XzU4qnIi+/eS2DPq7JoweM+yT7cRqnz2+Eh41kEqUWCtobKMdukyRLiT0fQjOl09r0mukDq8EFnAtT0JwbwmkVgVUFC48yDwx5bI4XBohLW4FSarlRI1L79hBNg/6Y0QpqdLvEU1NXcoz4yQcMAeQainUS9eVEihAowziQrIbSW4UYI6u9IeuhJMKki9CwfaRiJjA6s13lxdo4OoDUQybpKBgr4ttbWxi98QjRMMgnUcrqf96ZYdmDiKl5O4O+UNs45P1/NjzCSahfFRn0k8nV8uj/lMJ9Oz+OoYM1+enkazY8xqGZwtj/oM+UAF+rxZYpv3l1qcd6sNVNVjXelXiOWKIO/GtjfUjSW5elgS7vgcwzPDL5mszh05GnWEZojSDdycI6CzO6YkWq7xthWmN0jtBuW26CxRR1Hokx/ParYDsfoKrSY71UYh+Z2XALsDw8mk1yPcXBPmcF3nmfPi0OUvKOjbn3tlBcdDgZrEwGDCtkLXiO9cN2A+us982GGkzaUmKPWfqtHq1npDo1GV2XmGb5CU0LVgl+/C1KdkV5nQuhnYlUg9tJt8F/Vc1HKws1y7QYVNFqz7hTXolmDVLwYsdlg8YBOHTQZs6rDpgM0cNrNYdYCXTgl/gLnhlhbfCkpFg8tvDkz9v6CuCLpCEsNV2+cNb00kLdC/d+3tE/wIYwSXxMCvQpKSocfUn4ZRZN17a9q+3le2lrPG8hXqlcjAHYTt7b1ybl/+m1iapMQFgR7NDiwfpslJFzgl2mRYwuAxQkHK7UT60ioPf6/FHwAAAP//AwBQSwMEFAAGAAgAAAAhAN+AEEUqAgAAQgcAABIAAAB3b3JkL2ZvbnRUYWJsZS54bWy0lNFu0zAUhu+ReIfI9yyO22ahWjpthVzugo0HcFOnsRTbke027AG44gpxyTuMB0DjaZjE3oLjOB3QriKRNkeylHPi46Mv/39OTj+IKtgwbbiSKYqOMAqYzNWSy1WK3l9lrxIUGEvlklZKshRdM4NOZy9fnDTTQklrAjgvzVSnqLS2noahyUsmqDlSNZOQK5QW1MKrXoWqKHjO3qh8LZi0IcE4DjWrqIW7Tclrg7pqTZ9qjdLLWqucGQPNisrXE5RLNOu6C5qppAK6ntOKLzRvEzWVyrAIchtapQgTnOEJ7O4Z45HbUegq5CXVhtmHD7EPF1Tw6nobNQ03xidqbvNyG99QzemiYj5l+AoSa7PAKTrDsMjbDPlIlKKxC+Dj8y5CoKludZHRv5G8reM/ed3WgQjUeTgF7Yf+/+yRuPtx8+v22/3Xj/ffPx/AcQ44HAaHo90fxZHET4MjcV2T5PgPDpLgbDSfkF0cUfwfHMAxGojjigtmggvWBO+UoF45+wIhOAYiE+DhhDIaJBDd1m0F1VMg4AtMzv4m0jKC6C4RDOZsZXVYIFE2lAgtoeOD0ohbjzgUThzP65Q4ejIQEShjIIi7m08/b7+0IGhlL2CObL19ycXlWnam35smEYgFg0jgSv8MsA9dW9XV7TdM4Ca3Rp0OYAiQJMlcaFcrfdyDI3eq/zCZUwFT9ZBYnFu8a5x7holluGseH6sYj59nrHbz1cx+AwAA//8DAFBLAwQUAAYACAAAACEAStiKkrsAAAAEAQAAFAAAAHdvcmQvd2ViU2V0dGluZ3MueG1sjM7BasMwDMbxe2HvEHRfnfUwSkhSKKMv0PUBXEdpDLFkJG3e9vQ1bJfdehSf+PHvD19pbT5RNDIN8LJtoUEKPEW6DXB5Pz3voVHzNPmVCQf4RoXD+LTpS1fwekaz+qlNVUg7GWAxy51zGhZMXreckeo2syRv9ZSb43mOAd84fCQkc7u2fXWCq7daoEvMCn9aeUQrLFMWDqhaQ9L66yUfCcbayNliij94YjkKF0VxY+/+tY93AAAA//8DAFBLAwQUAAYACAAAACEAfLG9I3MBAADHAgAAEAAIAWRvY1Byb3BzL2FwcC54bWwgogQBKKAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACcUstOwzAQvCPxD1HurdOCEEJbV6gV4sBLaqBny94kFo5t2W7V/j2bhoYgbvi0M2uPZ8eG5aE12R5D1M4u8tm0yDO00ilt60X+Xj5MbvMsJmGVMM7iIj9izJf88gLegvMYksaYkYSNi7xJyd8xFmWDrYhTalvqVC60IhEMNXNVpSWundy1aBObF8UNw0NCq1BN/CCY94p3+/RfUeVk5y9+lEdPhjmU2HojEvKXzo6ZKpdaYAMLpUvClLpFPiN6APAmaowd1xewdUFFPgfWF7BqRBAyUX58dgVsBOHee6OlSBQsf9YyuOiqlL2eIsi648DGW4Bi2aDcBZ2OvAA2hvCkbW+jL8hWEHUQvvn2NiDYSGFwRbPzSpiIwH4IWLnWC3vk5PNckd5nfPelW3fhfB/5TY5m3OrUbLyQnZfr8bSjBmwoElRk/yz3Q8AjPUcw3Z2UlK1Rnff8bXT5ffT/ks/m04LWKbAzR08yfBj+BQAA//8DAFBLAwQUAAYACAAAACEATpDnu3cBAADLAgAAEQAIAWRvY1Byb3BzL2NvcmUueG1sIKIEASigAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAfJJNT8MwDIbvSPyHKvc26So+Fm1FfIgTSEgMgbiFxNsCTRol3sr+PWm7FYoQt9h+/ch+ndnFp6mSLfigazsnecZIAlbWStvVnDwtbtNzkgQUVomqtjAnOwjkojw+mknHZe3hwdcOPGoISSTZwKWbkzWi45QGuQYjQhYVNhaXtTcCY+hX1An5IVZAJ4ydUgMolEBBW2DqBiLZI5UckG7jqw6gJIUKDFgMNM9y+q1F8Cb82dBVfiiNxp2LO+3H/clWsi8O6s+gB2HTNFlTdGPE+XP6cn/32K2aatt6JYGUMyU5aqygnNHvZ3yFzds7SOzTQxDd/IBdU3sVYmUUxR4FQXrtMN6o7xsloroSAe/j0ZYa1NWuvFRGWx3QC6x9h/slaK/nYavbq5eTTjGEcfLOKC49CASVxNV5b9Sh8lxc3yxuSRm3P0/ZNM3zRX7GJ4wz9tquO+pvregTZj/hv8R82hHZghX8pBgTD4DeovH3K78AAAD//wMAUEsDBBQABgAIAAAAIQD/x/uldQcAAIA6AAAPAAAAd29yZC9zdHlsZXMueG1stJvNcts2EMfvnek7cHh39eXIjSdKxnGSxjNO6ljy9AyRkImGIlQSiu3c+wA99x3aB+i0b9OZ5i0KLCiYIkVxV2ROMT+wPyx28V/awT57cb+MvU88zYRMJv7gu77v8SSQoUhuJ/7N7M3R976XKZaELJYJn/gPPPNfPP/2m2d3p5l6iHnmaQNJdppO/Eip1WmvlwURX7LsO7niiX62kOmSKX2Z3vbkYiEC/koG6yVPVG/Y7497KY+Z0vAsEqvMz63dYazdyTRcpTLgWaZnu4ytvSUTif9cTy+UwSu+YOtYZeYyvUrzy/wK/nkjE5V5d6csC4SY6YlrF5cikenbsyQTvn7CWabOMsEm/r///PHf339++f3XL3/9Zp5E5pWdY4JMFUy9FKHwewaXfdbDPrF44g+HmzvnBr91L2bJ7eYeT45uptvT+Bwdnb83t+ba7sRn6dH0zBjrgY+bfwu+rrY811cwlRUL9KppM2yhuI6eDoYxGgsT5eHQXVyvY32DqUuzEDkHbGhe0bK+LK24jquO8tRmiX7KF5cy+MjDqdIPJj4Q9M2bi6tUyFSoh4n/9KmZg7455UvxVoQhN0mZ37tJIhHynyKe3GQ8fLz/4Q2kWG4xkOtEaQ/GJ5AFcRa+vg/4yqSYNp0wE+H3ZkBszGYFDkxoLR5nY2+UqHDzlw1yYMO4kxJxZraRB/PfCwKv161BQ+NR0QGwS5rrqL2J4/YmnrQ3MW5v4qS9CS2ebSNic6OQlfigKhnY5CvmxOjpnpQ1IypZ1DiikjSNIyo50jiikhKNIyoZ0DiiEvDGEZX4No6ohHPviICBcJWzaASrgdrYM6FibsbvFaBBS6nLq413xVJ2m7JV5JnCWp72PrGcrucKN1WQ08PFcqpSmdw2rogu0GbrHqzJr5eriGVCf9E0LL2ttIeDZmwec++HVISNqCc2+So+wbfJzhJ2FbOARzIOeerN+L2NKGH8e+lN7YdG4+RahvVS3EbKm0ZQchth45p8r18Ja/9S6G+gpoiOa1xpMo6K4bgmL+uNv+OhWC83S4P4GhlbPSeEuYSAKe7Vm/GxWcVq0jd6YQKAccGWC7oLYB8xf1tc6PZNjDHzt6XoQPuI+dvCdaB9yI/98SUrzSuWfvRQ2+uEvHfPZSzTxTre7IHGHXxC3sEOgXOBvImdfZRInJB38JZ8emdBoH9zw+QpORaPOkqgkMNhKbDZ8L6Qg1KSvQHBI3KASqwhgdVOawkgsuhe80/C/OGJWgxApd23ZuN2HtWsgC5BqG/oD2upmr+hhzWah6VcJPrPJRn3cLRRzc7D0vJ8svWOEON2hY8AalcBCaB2pZAAqsmP+m8eVxPxkPbFkcAiy7KrYpB2aGU+ISuzA9FKQEd1E/H9VbN763OhWjcRFHKAqnUTQSFHp1TLXN1EsDqrmwhWTdWoj1FRUylOketmEeS+BBAedSPeCFA34o0AdSPeCFB78W6GdCfeCBZZG5ymFsUbAYJXKL/qO1BRvBEgsjZYtcv/ZrSpe2Bl/y+3HYg3gkIOUFW8ERRydOrEG8GCVyiZUGI5qUOwuhFvBKgb8UaAuhFvBKgb8UaAuhFvBKi9eDdDuhNvBIusDU5Ti+KNAJHlwYGK4o0AwSsUbdgp3rDrv7p4IyjkAFXFG0EhR6ckqO4jFcEiB6jEcuKNYMErlGTIWZDcFKe6EW+ER92INwLUjXgjQN2INwLUXrybId2JN4JF1ganqUXxRoDI8uBARfFGgMjasFO8YTN+dfFGUMgBqoo3gkKOTklQnc4hWOQAlVhOvBEsyJfW4o0AwSuHgigedSPeCI+6EW8EqBvxRoDai3czpDvxRrDI2uA0tSjeCBBZHhyoKN4IEFkbdoo37JGvLt4ICjlAVfFGUMjRKQmqE28EixygEstJHYLVjXgjQJCYrcUbAYJXDgDBLqKEqRvxRnjUjXgjQO3FuxnSnXgjWGRtcJpaFG8EiCwPDlQUbwSIrA3mnK0+L4o+njqoSQLsOYPNqQY0cFgTJCwwd/CaL3iqO5l48+mQlsCNhwRiTXpgXXwp5UcPd7B7VJMgaJSYx0LCke4HOKVTaEQYnezpJJj9eO69tQ0wlXGQUtsnb3T3ULFdCDqUTOOQnqd6WOmWndXmZLmxphuETF9X3gIEfWgXuiGIQcePafHR70BLVd7oA/9lmwPhZ93uFm7e6fdHw9HL45F1Jm+PYuHP60xdm0PFF8njq/adLGGrmYSdmnP6+QPXVJW3UR3DfxWZi7yNaq2kedX2TkEHmm3HSrca0Sb+jEVyyYyv0GXmbtixupcNbIDn1bUKIr1Yge7q2rdW/cpi1RzihwV77CDZLFt+mP/xk8++t3Wk1M62ZpbKHFzfN8NBZYY2nB4cebcrXp2XbiGDmTRNzJ39grfVPLaB0D/YgOsWRAieTb/wnlmz+vk5j+N3TEdA56Zc6fWoeTXmC2WfDvpQs0um5lIpuawfn8KRdjC/y4DOoeJk7OX+xEjWyzlPdU/avmUf7lh2ezLXRtgpwCYTsCuuZ5inwuan7Pn/AAAA//8DAFBLAQItABQABgAIAAAAIQDd/JU3ZgEAACAFAAATAAAAAAAAAAAAAAAAAAAAAABbQ29udGVudF9UeXBlc10ueG1sUEsBAi0AFAAGAAgAAAAhAB6RGrfzAAAATgIAAAsAAAAAAAAAAAAAAAAAnwMAAF9yZWxzLy5yZWxzUEsBAi0AFAAGAAgAAAAhANZks1H6AAAAMQMAABwAAAAAAAAAAAAAAAAAwwYAAHdvcmQvX3JlbHMvZG9jdW1lbnQueG1sLnJlbHNQSwECLQAUAAYACAAAACEAwK8gW00CAACaBAAAEQAAAAAAAAAAAAAAAAD/CAAAd29yZC9kb2N1bWVudC54bWxQSwECLQAUAAYACAAAACEAxxxtFJwGAABRGwAAFQAAAAAAAAAAAAAAAAB7CwAAd29yZC90aGVtZS90aGVtZTEueG1sUEsBAi0AFAAGAAgAAAAhAKTIKmgQAwAAAwcAABEAAAAAAAAAAAAAAAAAShIAAHdvcmQvc2V0dGluZ3MueG1sUEsBAi0AFAAGAAgAAAAhAN+AEEUqAgAAQgcAABIAAAAAAAAAAAAAAAAAiRUAAHdvcmQvZm9udFRhYmxlLnhtbFBLAQItABQABgAIAAAAIQBK2IqSuwAAAAQBAAAUAAAAAAAAAAAAAAAAAOMXAAB3b3JkL3dlYlNldHRpbmdzLnhtbFBLAQItABQABgAIAAAAIQB8sb0jcwEAAMcCAAAQAAAAAAAAAAAAAAAAANAYAABkb2NQcm9wcy9hcHAueG1sUEsBAi0AFAAGAAgAAAAhAE6Q57t3AQAAywIAABEAAAAAAAAAAAAAAAAAeRsAAGRvY1Byb3BzL2NvcmUueG1sUEsBAi0AFAAGAAgAAAAhAP/H+6V1BwAAgDoAAA8AAAAAAAAAAAAAAAAAJx4AAHdvcmQvc3R5bGVzLnhtbFBLBQYAAAAACwALAMECAADJJQAAAAA=";
        String docName="test.docx";
        //变动部分
        params.put("base64Doc",base64Doc);
        params.put("docName", docName);
        
        String result=requestContent(requestUrl,params,method);
        System.out.println(result);
    }
}

PYTHON版本:

# -*- coding: utf-8 -*-
# flake8: noqa
__author__ = 'wukong'

import urllib
from urllib import urlencode

#配置您申请的appKey和openId
app_key="***"
open_id="***"

"""
request_url 请求地址
params 请求参数
method 请求方法

"""
def request_content(request_url,params,method):
    params = urlencode(params)
    
    if method and method.lower() =="get":
        f = urllib.urlopen("%s?%s" % (request_url, params))
    else:
        f = urllib.urlopen(request_url, params)
 
    content = f.read()
    print content

   
def main():
    
    domain="http://api.xiaocongjisuan.com/";
    servlet="develop/doccontent/analysis"
    method="post"
    request_url=domain+servlet
    
    #字典
    params ={}
    
    params["docName"]=app_key
    params["openId"]=open_id
    
    #变动部分
    base64Doc=""
    params["base64Doc"]=base64Doc
    params["docName"]="test.docx"
    
    request_content(request_url,params,method)
    
if __name__ == '__main__':
    main()

PHP版本:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<?php

/**
 * @author 
 * @copyright 2019
 */
 
header("content-type:text/html;charset=utf-8");         //设置编码
 
//配置您申请的appKey和openId
$app_key = "***";
$open_id = "***";

/**
$url 请求地址
$params 请求参数
$ispost 请求方法
*/

function http_curl($url,$params=false,$ispost=false){
   
    $httpInfo = array();
    $ch = curl_init();

    curl_setopt( $ch, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1 );
    curl_setopt( $ch, CURLOPT_USERAGENT , "xiaocongjisuan");
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT , 60 );
    curl_setopt( $ch, CURLOPT_TIMEOUT , 60);
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER , true );
    
    if( $ispost )
    {
        curl_setopt( $ch , CURLOPT_POST , true );
        curl_setopt( $ch , CURLOPT_POSTFIELDS , $params );
        curl_setopt( $ch , CURLOPT_URL , $url );
    }
    else
    {
        if($params){
            curl_setopt( $ch , CURLOPT_URL , $url.'?'.$params );
        }else{
            curl_setopt( $ch , CURLOPT_URL , $url);
        }
    }
    
    $response = curl_exec( $ch );
    if ($response === FALSE) {
        //echo "cURL Error: " . curl_error($ch);
        return false;
    }
    $httpCode = curl_getinfo( $ch , CURLINFO_HTTP_CODE );
    $httpInfo = array_merge( $httpInfo , curl_getinfo( $ch ) );
    curl_close( $ch );
    
    return $response;
}

function main(){
    
    global $app_key;
    global $open_id;
    
    $domain="http://api.xiaocongjisuan.com/";
    $servlet="develop/doccontent/analysis";
    $method="get";
    
    $url=$domain."".$servlet;
    
    $params['appKey']=$app_key;
    $params['openId']=$open_id;
    
    //变动部分
    $params["base64Doc"]="";
    $params["docName"]="test.docx";

    //编码转换
    foreach ($params as $key=>$value) {
        $params[$key]=mb_convert_encoding($value, "UTF-8", "GBK");
    }

    $paramstring = http_build_query($params);
    $content = http_curl($url,$paramstring,true);
    
    return $content;
}

echo main();
?>

C#版本:

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;

namespace ConsoleApplication1
{
    class Program
    {
        private static string appKey="yours";
        private static string openId = "yours";
       
        static string getResponseAsString(HttpWebResponse rsp, Encoding encoding)
        {
            System.IO.Stream stream = null;
            StreamReader reader = null;
            try
            {
                // 以字符流的方式读取HTTP响应
                stream = rsp.GetResponseStream();
                reader = new StreamReader(stream, encoding);
                return reader.ReadToEnd();
            }
            finally
            {
                // 释放资源
                if (reader != null) reader.Close();
                if (stream != null) stream.Close();
                if (rsp != null) rsp.Close();
            }
        }

        /*
         * parameters 参数
         * encode 编码
         */

        static string buildQuery(IDictionary<string,object> parameters, string encode)
        {
            StringBuilder postData = new StringBuilder();
            bool hasParam = false;
            IEnumerator<KeyValuePair<string, object>> dem = parameters.GetEnumerator();
            while (dem.MoveNext())
            {
                string name = dem.Current.Key;
                string value = dem.Current.Value.ToString(); ;
                // 忽略参数名或参数值为空的参数
                if (!string.IsNullOrEmpty(name))//&& !string.IsNullOrEmpty(value)
                {
                    if (hasParam)
                    {
                        postData.Append("&");
                    }
                    postData.Append(name);
                    postData.Append("=");
                    if (encode == "gb2312")
                    {
                        postData.Append(System.Web.HttpUtility.UrlEncode(value, Encoding.GetEncoding("gb2312")));
                    }
                    else if (encode == "utf8")
                    {
                        postData.Append(System.Web.HttpUtility.UrlEncode(value, Encoding.UTF8));
                    }
                    else
                    {
                        postData.Append(value);
                    }
                    hasParam = true;
                }
            }
            return postData.ToString();
        }


        /**
        *
        * @param url 请求地址
        * @param params 请求参数
        * @param method 请求方法
        * @return 请求结果
        * @throws Exception
        */
        static string requestContent(string url, IDictionary<string,object> parameters, string method)
        {
            if (method.ToLower() == "post")
            {
                HttpWebRequest req = null;
                HttpWebResponse rsp = null;
                System.IO.Stream reqStream = null;
                try
                {
                    req = (HttpWebRequest)WebRequest.Create(url);
                    req.Method = method;
                    req.KeepAlive = false;
                    req.ProtocolVersion = HttpVersion.Version10;
                    req.Timeout = 5000;
                    req.ContentType = "application/x-www-form-urlencoded;charset=utf-8";
                    byte[] postData = Encoding.UTF8.GetBytes(buildQuery(parameters, "utf8"));
                    reqStream = req.GetRequestStream();
                    reqStream.Write(postData, 0, postData.Length);
                    rsp = (HttpWebResponse)req.GetResponse();
                    Encoding encoding = Encoding.GetEncoding(rsp.CharacterSet);
                    return getResponseAsString(rsp, encoding);
                }
                catch (Exception ex)
                {
                    return ex.Message;
                }
                finally
                {
                    if (reqStream != null) reqStream.Close();
                    if (rsp != null) rsp.Close();
                }
            }
            else
            {
                //创建请求
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url + "?" + buildQuery(parameters, "utf8"));

                //GET请求
                request.Method = "GET";
                request.ReadWriteTimeout = 5000;
                request.ContentType = "text/html;charset=UTF-8";
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                Stream myResponseStream = response.GetResponseStream();
                StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("utf-8"));

                //返回内容
                string retString = myStreamReader.ReadToEnd();
                return retString;
            }
        }

        static void Main(string[] args)
        {

            String domain = "http://api.xiaocongjisuan.com/";
            String servlet = "develop/doccontent/analysis";
            String method = "post";
            String url = domain + servlet;


            var parameters = new Dictionary<string,object>();

            parameters.Add("appKey", appKey);
            parameters.Add("openId", openId);

            //变动部分
            String base64Doc = "";
            String docName = "test.docx";
            parameters.Add("base64Doc", base64Doc);
            parameters.Add("docName", docName);

            string result = requestContent(url, parameters, method);
            Console.WriteLine(result);
            Console.Read();

        }

    }
}

等等…,其它语言,请看说明文档

由于接口采用的是base64传值,文件转base64对各个开发语言来说都非常简单。所以在此就不再赘述。以上

原文 

https://segmentfault.com/a/1190000020612263

本站部分文章源于互联网,本着传播知识、有益学习和研究的目的进行的转载,为网友免费提供。如有著作权人或出版方提出异议,本站将立即删除。如果您对文章转载有任何疑问请告之我们,以便我们及时纠正。

PS:推荐一个微信公众号: askHarries 或者qq群:474807195,里面会分享一些资深架构师录制的视频录像:有Spring,MyBatis,Netty源码分析,高并发、高性能、分布式、微服务架构的原理,JVM性能优化这些成为架构师必备的知识体系。还能领取免费的学习资源,目前受益良多

转载请注明原文出处:Harries Blog™ » python、java、ruby、node等如何提取office文档中的内容?

赞 (0)
分享到:更多 ()

评论 0

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址