转载

编程乐趣--汉字转拼音

各个国家语言上差异较大，本世纪英语特别的流行

各个国家语言上差异较大，本世纪英语特别的流行，至少在程序员上编程就是属于外国的。是英文编写的。那么我们的汉字就是一个特例了。下面说说汉字是如何转拼音的。

jar包准备

编程乐趣--汉字转拼音

点我下载pinyin4j.jar

如果你是maven那就方便了。

<!-- 汉字转拼音jar -->
        <dependency>
            <groupId>com.belerweb</groupId>
            <artifactId>pinyin4j</artifactId>
            <version>2.5.0</version>
        </dependency>

使用

jar包准备好了，我们就开始使用吧，这个jar都分装好了，我们只需要简单的调用就行。
第一步：定义汉字拼音的输出格式

HanyuPinyinOutputFormat hypy = new HanyuPinyinOutputFormat();

上面定义的hypy这个类就是指定拼音的格式。作为中国人我们知道拼音就是字母拼接在一起的。还有就是声调，所以格式就是制定字母显示和声调的。
通过看HanYuPinYinOutputForMat这个类的源码我们可以看到输出格式有三个属性，而且这三个属性的默认值都是设置好的

/**
   * Restore default variable values for this class
   * 
   * Default values are listed below:
   * 
   * <p>
   * HanyuPinyinVCharType := WITH_U_AND_COLON <br>
   * HanyuPinyinCaseType := LOWERCASE <br>
   * HanyuPinyinToneType := WITH_TONE_NUMBER <br>
   */
   public void restoreDefault() {
    vCharType = HanyuPinyinVCharType.WITH_U_AND_COLON;
    caseType = HanyuPinyinCaseType.LOWERCASE;
    toneType = HanyuPinyinToneType.WITH_TONE_NUMBER;
  }

上面源码的意思就是，如果我们不设置，那么这三个属性默认就是上面的情况，那么这些格式都是对应拼音的那种格式呢，下面继续往下走。

LOWERCASE
Combination	WITH_U_AND_COLON	WITH_V	WITH_U_UNICODE
WITH_TONE_NUMBER	lu:3	lv3	lü3
WITHOUT_TONE	lu:	lv	lü
WITH_TONE_MARK	<font color="red">throw exception</font>	<font color="red">throw exception</font>	lǚ

UPPERCASE
Combination	WITH_U_AND_COLON	WITH_V	WITH_U_UNICODE
WITH_TONE_NUMBER	LU:3	LV3	LÜ3
WITHOUT_TONE	LU:	LV	LÜ
WITH_TONE_MARK	<font color="red">throw exception</font>	<font color="red">throw exception</font>	LǙ

看到上面两张表格没，就是hypy格式中三种属性的结合对应的汉语拼音的显示。在这里我还是解释一下吧。

* @see HanyuPinyinVCharType
 * @see HanyuPinyinCaseType
 * @see HanyuPinyinToneType
 
hypy.setCaseType(HanyuPinyinCaseType.LOWERCASE);  
hypy.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER);  
hypy.setVCharType(HanyuPinyinVCharType.WITH_V);

首先setCaseType是指定我们输出的拼音的大小写，这个就不用多讲了。

setToneType是指定我们带声调的拼音的显示。这里有三种选择

- HanyuPinyinToneType.WITH_TONE_NUMBER 通过数字标注声调 zhang1  zhang2 zhang3 zhang4 分别是一二三四声、

-HanyuPinyinToneType.WITHOUT_TONE：不带声调，就是没有声调
  HanyuPinyinToneType.WITH_TONE_MARK：通过符号标注，和我们平时书写一样，像u 和 ü 就通过符号区别

setVCharType：WITH_U_AND_COLON + WITH_V + WITH_U_UNICODE 就是处理u ü 的。

输出

在上面一步我们已经将格式处理好了，这里我们开始处理输出

PinyinHelper.toHanyuPinyinStringArray("张新华".charAt(2), hypy)[0]

就这样拼音就出来了，这个就返回出来我们汉字的拼音了。对于使用者到这里就结束了。但是出于好奇我们继续往下看看。

PinyinHelper这个单列中toHanyuPinyinStringArray

static public String[] toHanyuPinyinStringArray(char ch, HanyuPinyinOutputFormat outputFormat)
      throws BadHanyuPinyinOutputFormatCombination {
    return getFormattedHanyuPinyinStringArray(ch, outputFormat);
  }

意思就是获取格式化后的ch的拼音

而在getFormtedHanyuPinyinStringArray方法里首先是获取未格式化的拼音，然后在对拼音格式化。

String[] pinyinStrArray = getUnformattedHanyuPinyinStringArray(ch);

如何获取未格式化的拼音是重点我们就侧重这部分

private static String[] getUnformattedHanyuPinyinStringArray(char ch) {
    return ChineseToPinyinResource.getInstance().getHanyuPinyinStringArray(ch);
  }

在上面我们能够看到ChineseToPinyinResource类，在这个类中我们可以找到他的初始化数据，这里我们可以理解为web项目中的数据库

/**
   * Initialize a hash-table contains <Unicode, HanyuPinyin> pairs
   */
  private void initializeResource() {
    try {
      final String resourceName = "/pinyindb/unicode_to_hanyu_pinyin.txt";

      setUnicodeToHanyuPinyinTable(new Properties());
      getUnicodeToHanyuPinyinTable().load(ResourceHelper.getResourceInputStream(resourceName));

    } catch (FileNotFoundException ex) {
      ex.printStackTrace();
    } catch (IOException ex) {
      ex.printStackTrace();
    }
  }

上面的代码指出了这个所谓的额数据库就是unicode_to_hanyu_pinyin.txt

编程乐趣--汉字转拼音

这里面究竟是什么呢，打开你会发现其实就是拼音和Unicode的对应关系。因为在我们获取拼音的时候是首先获取他的Unicode码的getHanyuPinyinRecordFromChar(ch);，然后通过Unicode码去unicode_to_hanyu_pinyin.txt找对应关系的，至于其他文件是处理其他的语种的，时间能力有限，暂不深究！

获取未格式化的拼音之后就是处理格式了，