pinyin4j获取多音字首字母同时保留非中文字符

2024-05-14 14:18

本文主要是介绍pinyin4j获取多音字首字母同时保留非中文字符,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

pinyin4j获取多音字首字母同时保留非中文字符

  • 前情:获取中文的首字母,要求正确识别多音字(例:重庆,重启,重量,成长等),同时需要保留非中文字符
    • 要求项目中导入com.belerweb.pinyin4j.2.5.1包,然后将下面的类放入项目中即可使用
      • ==以下内容暂时还未经过大量数据测试,后续若发现问题会及时修改==

前情:获取中文的首字母,要求正确识别多音字(例:重庆,重启,重量,成长等),同时需要保留非中文字符

当前pinyin4j的最新版2.5.1里面不支持多音字的正确获取首字母(网上找的解决方案大多数也是当遇到多音字时只取第一个拼音),于是扩展了下它的部分源码,支持多音字的首字母获取。

要求项目中导入com.belerweb.pinyin4j.2.5.1包,然后将下面的类放入项目中即可使用

以下内容暂时还未经过大量数据测试,后续若发现问题会及时修改

以下表格为修改记录

修改时间修改内容
2019-05-28发布
2020-04-23修改部分获取首字母异常,加了py.length() > 0判断
2022-07-05支持classpath下自定义拼音扩展库

如下是重新定义的**PinyinHelper.toHanYuPinyinString()**方法,命名、使用方式与源码一致,使用时需注意正确地导入类名

multi_pinyin.txt是多音字库(pinyin4j源码包里有),可以自己改个名字以及存储路径来扩展里面的多音字,里面并不是全的,比如“重启”需要添加“重启 (chong2,qi3)”才能正确识别

import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType;
import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;
import net.sourceforge.pinyin4j.format.HanyuPinyinToneType;
import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType;
import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;
import net.sourceforge.pinyin4j.multipinyin.Trie;
import org.springframework.core.io.ClassPathResource;
import org.springframework.util.StringUtils;import java.io.BufferedInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;public class PinyinHelper {/*** 自定义拼音扩展库,从classpath下查找*/private static final String MULTI_PINYIN_APPENDER = "multi_pinyin_appender.txt";public static String toHanYuPinyinString(String str, HanyuPinyinOutputFormat outputFormat, String separate, boolean retain) throws BadHanyuPinyinOutputFormatCombination {ChineseToPinyinResource resource = ChineseToPinyinResource.getInstance();StringBuilder resultPinyinStrBuf = new StringBuilder();char[] chars = str.toCharArray();for (int i = 0; i < chars.length; i++) {// 匹配到的最长的结果String result = null;char ch = chars[i];Trie currentTrie = resource.getUnicodeToHanyuPinyinTable();int success = i;int current = i;do {String hexStr = Integer.toHexString((int) ch).toUpperCase();currentTrie = currentTrie.get(hexStr);if (currentTrie != null) {if (currentTrie.getPinyin() != null) {result = currentTrie.getPinyin();success = current;}currentTrie = currentTrie.getNextTire();} else {}current++;if (current < chars.length) {ch = chars[current];} else {break;}} while (currentTrie != null);// 如果在前缀树中没有匹配到,那么它就不能转换为拼音,直接输出或者去掉if (result == null) {if (retain) {if (i != 0 && current != success && resultPinyinStrBuf.lastIndexOf(separate) != resultPinyinStrBuf.length() - 1) {resultPinyinStrBuf.append(separate);}resultPinyinStrBuf.append(chars[i]);}} else {String[] pinyinStrArray = resource.parsePinyinString(result);if (pinyinStrArray != null) {for (int j = 0; j < pinyinStrArray.length; j++) {if (i != 0 && current != success && resultPinyinStrBuf.lastIndexOf(separate) != resultPinyinStrBuf.length() - 1) {resultPinyinStrBuf.append(separate);}resultPinyinStrBuf.append(PinyinFormatter.formatHanyuPinyin(pinyinStrArray[j], outputFormat));// 不是最后一个,(也不是拼音的最后一个,并且不是最后匹配成功的)if (current < chars.length || (j < pinyinStrArray.length - 1 && i != success)) {resultPinyinStrBuf.append(separate);}if (i == success) {break;}}}}i = success;}return resultPinyinStrBuf.toString();}static class PinyinFormatter {static String formatHanyuPinyin(String pinyinStr, HanyuPinyinOutputFormat outputFormat)throws BadHanyuPinyinOutputFormatCombination {if ((HanyuPinyinToneType.WITH_TONE_MARK == outputFormat.getToneType())&& ((HanyuPinyinVCharType.WITH_V == outputFormat.getVCharType()) || (HanyuPinyinVCharType.WITH_U_AND_COLON == outputFormat.getVCharType()))) {throw new BadHanyuPinyinOutputFormatCombination("tone marks cannot be added to v or u:");}if (HanyuPinyinToneType.WITHOUT_TONE == outputFormat.getToneType()) {pinyinStr = pinyinStr.replaceAll("[1-5]", "");} else if (HanyuPinyinToneType.WITH_TONE_MARK == outputFormat.getToneType()) {pinyinStr = pinyinStr.replaceAll("u:", "v");pinyinStr = convertToneNumber2ToneMark(pinyinStr);}if (HanyuPinyinVCharType.WITH_V == outputFormat.getVCharType()) {pinyinStr = pinyinStr.replaceAll("u:", "v");} else if (HanyuPinyinVCharType.WITH_U_UNICODE == outputFormat.getVCharType()) {pinyinStr = pinyinStr.replaceAll("u:", "ü");}if (HanyuPinyinCaseType.UPPERCASE == outputFormat.getCaseType()) {pinyinStr = pinyinStr.toUpperCase();}return pinyinStr;}/*** Convert tone numbers to tone marks using Unicode <br/><br/>** <b>Algorithm for determining location of tone mark</b><br/>* <p>* A simple algorithm for determining the vowel on which the tone mark* appears is as follows:<br/>** <ol>* <li>First, look for an "a" or an "e". If either vowel appears, it takes* the tone mark. There are no possible pinyin syllables that contain both* an "a" and an "e".** <li>If there is no "a" or "e", look for an "ou". If "ou" appears, then* the "o" takes the tone mark.** <li>If none of the above cases hold, then the last vowel in the syllable* takes the tone mark.** </ol>** @param pinyinStr the ascii represention with tone numbers* @return the unicode represention with tone marks*/private static String convertToneNumber2ToneMark(final String pinyinStr) {String lowerCasePinyinStr = pinyinStr.toLowerCase();if (lowerCasePinyinStr.matches("[a-z]*[1-5]?")) {final char defautlCharValue = '$';final int defautlIndexValue = -1;char unmarkedVowel = defautlCharValue;int indexOfUnmarkedVowel = defautlIndexValue;final char charA = 'a';final char charE = 'e';final String ouStr = "ou";final String allUnmarkedVowelStr = "aeiouv";final String allMarkedVowelStr = "āáăàaēéĕèeīíĭìiōóŏòoūúŭùuǖǘǚǜü";if (lowerCasePinyinStr.matches("[a-z]*[1-5]")) {int tuneNumber =Character.getNumericValue(lowerCasePinyinStr.charAt(lowerCasePinyinStr.length() - 1));int indexOfA = lowerCasePinyinStr.indexOf(charA);int indexOfE = lowerCasePinyinStr.indexOf(charE);int ouIndex = lowerCasePinyinStr.indexOf(ouStr);if (-1 != indexOfA) {indexOfUnmarkedVowel = indexOfA;unmarkedVowel = charA;} else if (-1 != indexOfE) {indexOfUnmarkedVowel = indexOfE;unmarkedVowel = charE;} else if (-1 != ouIndex) {indexOfUnmarkedVowel = ouIndex;unmarkedVowel = ouStr.charAt(0);} else {for (int i = lowerCasePinyinStr.length() - 1; i >= 0; i--) {if (String.valueOf(lowerCasePinyinStr.charAt(i)).matches("[" + allUnmarkedVowelStr + "]")) {indexOfUnmarkedVowel = i;unmarkedVowel = lowerCasePinyinStr.charAt(i);break;}}}if ((defautlCharValue != unmarkedVowel) && (defautlIndexValue != indexOfUnmarkedVowel)) {int rowIndex = allUnmarkedVowelStr.indexOf(unmarkedVowel);int columnIndex = tuneNumber - 1;int vowelLocation = rowIndex * 5 + columnIndex;char markedVowel = allMarkedVowelStr.charAt(vowelLocation);return lowerCasePinyinStr.substring(0, indexOfUnmarkedVowel).replaceAll("v", "ü")+ markedVowel+ lowerCasePinyinStr.substring(indexOfUnmarkedVowel + 1,lowerCasePinyinStr.length() - 1).replaceAll("v", "ü");} else// error happens in the procedure of locating vowel{return lowerCasePinyinStr;}} else// input string has no any tune number{// only replace v with ü (umlat) characterreturn lowerCasePinyinStr.replaceAll("v", "ü");}} else// bad format{return lowerCasePinyinStr;}}}static class ChineseToPinyinResource {/*** A hash table contains <Unicode, HanyuPinyin> pairs*/private Trie unicodeToHanyuPinyinTable = null;/*** @param unicodeToHanyuPinyinTable The unicodeToHanyuPinyinTable to set.*/private void setUnicodeToHanyuPinyinTable(Trie unicodeToHanyuPinyinTable) {this.unicodeToHanyuPinyinTable = unicodeToHanyuPinyinTable;}/*** @return Returns the unicodeToHanyuPinyinTable.*/Trie getUnicodeToHanyuPinyinTable() {return unicodeToHanyuPinyinTable;}/*** Private constructor as part of the singleton pattern.*/private ChineseToPinyinResource() {initializeResource();}/*** Initialize a hash-table contains <Unicode, HanyuPinyin> pairs*/private void initializeResource() {try {final String resourceName = "/pinyindb/unicode_to_hanyu_pinyin.txt";final String resourceMultiName = "/pinyindb/multi_pinyin.txt";setUnicodeToHanyuPinyinTable(new Trie());getUnicodeToHanyuPinyinTable().load(ResourceHelper.getResourceInputStream(resourceName));getUnicodeToHanyuPinyinTable().loadMultiPinyin(ResourceHelper.getResourceInputStream(resourceMultiName));// 新增classpath下拼音扩展库if (StringUtils.hasLength(MULTI_PINYIN_APPENDER)) {ClassPathResource pathResource = new ClassPathResource(MULTI_PINYIN_APPENDER);if (pathResource.exists()) {getUnicodeToHanyuPinyinTable().loadMultiPinyin(pathResource.getInputStream());}}// 原始拼音扩展库,仅支持绝对路径getUnicodeToHanyuPinyinTable().loadMultiPinyinExtend();} catch (FileNotFoundException ex) {ex.printStackTrace();} catch (IOException ex) {ex.printStackTrace();}}Trie getHanyuPinyinTrie(char ch) {String codepointHexStr = Integer.toHexString((int) ch).toUpperCase();// fetch from hashtablereturn getUnicodeToHanyuPinyinTable().get(codepointHexStr);}/*** Get the unformatted Hanyu Pinyin representations of the given Chinese* character in array format.** @param ch given Chinese character in Unicode* @return The Hanyu Pinyin strings of the given Chinese character in array* format; return null if there is no corresponding Pinyin string.*/String[] getHanyuPinyinStringArray(char ch) {String pinyinRecord = getHanyuPinyinRecordFromChar(ch);return parsePinyinString(pinyinRecord);}String[] parsePinyinString(String pinyinRecord) {if (null != pinyinRecord) {int indexOfLeftBracket = pinyinRecord.indexOf(Field.LEFT_BRACKET);int indexOfRightBracket = pinyinRecord.lastIndexOf(Field.RIGHT_BRACKET);String stripedString =pinyinRecord.substring(indexOfLeftBracket + Field.LEFT_BRACKET.length(),indexOfRightBracket);return stripedString.split(Field.COMMA);} else {// no record found or mal-formatted recordreturn null;}}/*** @param record given record string of Hanyu Pinyin* @return return true if record is not null and record is not "none0" and* record is not mal-formatted, else return false*/private boolean isValidRecord(String record) {final String noneStr = "(none0)";return (null != record) && !record.equals(noneStr) && record.startsWith(Field.LEFT_BRACKET)&& record.endsWith(Field.RIGHT_BRACKET);}/*** @param ch given Chinese character in Unicode* @return corresponding Hanyu Pinyin Record in Properties file; null if no* record found*/private String getHanyuPinyinRecordFromChar(char ch) {// convert Chinese character to code point (integer)// please refer to http://www.unicode.org/glossary/#code_point// Another reference: http://en.wikipedia.org/wiki/Unicodeint codePointOfChar = ch;String codepointHexStr = Integer.toHexString(codePointOfChar).toUpperCase();// fetch from hashtableTrie trie = getUnicodeToHanyuPinyinTable().get(codepointHexStr);String foundRecord = null;if (trie != null) {foundRecord = trie.getPinyin();}return isValidRecord(foundRecord) ? foundRecord : null;}/*** Singleton factory method.** @return the one and only MySingleton.*/static ChineseToPinyinResource getInstance() {return ChineseToPinyinResourceHolder.THE_INSTANCE;}/*** Singleton implementation helper.*/private static class ChineseToPinyinResourceHolder {static final ChineseToPinyinResource THE_INSTANCE = new ChineseToPinyinResource();}/*** A class encloses common string constants used in Properties files** @author Li Min (xmlerlimin@gmail.com)*/class Field {static final String LEFT_BRACKET = "(";static final String RIGHT_BRACKET = ")";static final String COMMA = ",";}}static class ResourceHelper {/*** @param resourceName* @return resource (mainly file in file system or file in compressed* package) as BufferedInputStream*/static BufferedInputStream getResourceInputStream(String resourceName) {return new BufferedInputStream(ResourceHelper.class.getResourceAsStream(resourceName));}}
}

下面是使用方式: 里面用到了google的guava包的部分内容

import com.google.common.base.Splitter;
import com.google.common.collect.Lists;
import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;
import net.sourceforge.pinyin4j.format.HanyuPinyinToneType;
import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType;
import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;import java.util.List;/*** 拼音工具类*/
public class PinyinUtil {private static HanyuPinyinOutputFormat outputFormat;private static final String SEPARATE = "#";static {outputFormat = new HanyuPinyinOutputFormat();outputFormat.setVCharType(HanyuPinyinVCharType.WITH_V);outputFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);}/*** 获取文本的拼音** @param str     需要转换拼音的文本* @param retain  true:保留中文以外的其他字符* @param initial true:只需要首字母* @return 拼音*/public static String toPinYinString(String str, boolean retain, boolean initial) {StringBuilder sb = new StringBuilder();try {List<String> list = Lists.newArrayList();StringBuilder notChinese = new StringBuilder();for (int i = 0; i < str.length(); i++) {if (str.charAt(i) < 0x4E00 || str.charAt(i) > 0x9FA5) {notChinese.append(str.charAt(i));if (i == str.length() - 1) {list.add(notChinese.toString());}} else {if (notChinese.length() > 0) {list.add(notChinese.toString());notChinese = new StringBuilder();}}}String pinyin = PinyinHelper.toHanYuPinyinString(str, outputFormat, SEPARATE, retain);Splitter.on(SEPARATE).split(pinyin).forEach(py -> {if (list.contains(py)) {sb.append(py);return;}if (initial) {if (py.length() > 0) {sb.append(py.charAt(0));}} else {sb.append(py);}});} catch (BadHanyuPinyinOutputFormatCombination e) {e.printStackTrace();}return sb.toString();}}

下面是临时测试结果:

		String str = "成长,重启,重量,长大了,角色,角落,呼啦啦,1我2,3爱4,5你6";System.out.println(PinyinUtil.toPinYinString(str, true, true));// cz,cq,zl,zdl,js,jl,hll,1w2,3a4,5n6System.out.println(PinyinUtil.toPinYinString(str, false, true));// czcqzlzdljsjlhllwanSystem.out.println(PinyinUtil.toPinYinString(str, true, false));// chengzhang,chongqi,zhongliang,zhangdale,juese,jiaoluo,hulala,1wo2,3ai4,5ni6System.out.println(PinyinUtil.toPinYinString(str, false, false));// chengzhangchongqizhongliangzhangdalejuesejiaoluohulalawoaini

这篇关于pinyin4j获取多音字首字母同时保留非中文字符的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/988965

相关文章

SpringBoot整合mybatisPlus实现批量插入并获取ID详解

《SpringBoot整合mybatisPlus实现批量插入并获取ID详解》这篇文章主要为大家详细介绍了SpringBoot如何整合mybatisPlus实现批量插入并获取ID,文中的示例代码讲解详细... 目录【1】saveBATch(一万条数据总耗时:2478ms)【2】集合方式foreach(一万条数

python获取网页表格的多种方法汇总

《python获取网页表格的多种方法汇总》我们在网页上看到很多的表格,如果要获取里面的数据或者转化成其他格式,就需要将表格获取下来并进行整理,在Python中,获取网页表格的方法有多种,下面就跟随小编... 目录1. 使用Pandas的read_html2. 使用BeautifulSoup和pandas3.

SpringBoot UserAgentUtils获取用户浏览器的用法

《SpringBootUserAgentUtils获取用户浏览器的用法》UserAgentUtils是于处理用户代理(User-Agent)字符串的工具类,一般用于解析和处理浏览器、操作系统以及设备... 目录介绍效果图依赖封装客户端工具封装IP工具实体类获取设备信息入库介绍UserAgentUtils

C# foreach 循环中获取索引的实现方式

《C#foreach循环中获取索引的实现方式》:本文主要介绍C#foreach循环中获取索引的实现方式,本文给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录一、手动维护索引变量二、LINQ Select + 元组解构三、扩展方法封装索引四、使用 for 循环替代

Linux下如何使用C++获取硬件信息

《Linux下如何使用C++获取硬件信息》这篇文章主要为大家详细介绍了如何使用C++实现获取CPU,主板,磁盘,BIOS信息等硬件信息,文中的示例代码讲解详细,感兴趣的小伙伴可以了解下... 目录方法获取CPU信息:读取"/proc/cpuinfo"文件获取磁盘信息:读取"/proc/diskstats"文

Vue3组件中getCurrentInstance()获取App实例,但是返回null的解决方案

《Vue3组件中getCurrentInstance()获取App实例,但是返回null的解决方案》:本文主要介绍Vue3组件中getCurrentInstance()获取App实例,但是返回nu... 目录vue3组件中getCurrentInstajavascriptnce()获取App实例,但是返回n

SpringMVC获取请求参数的方法

《SpringMVC获取请求参数的方法》:本文主要介绍SpringMVC获取请求参数的方法,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下... 目录1、通过ServletAPI获取2、通过控制器方法的形参获取请求参数3、@RequestParam4、@

Python获取C++中返回的char*字段的两种思路

《Python获取C++中返回的char*字段的两种思路》有时候需要获取C++函数中返回来的不定长的char*字符串,本文小编为大家找到了两种解决问题的思路,感兴趣的小伙伴可以跟随小编一起学习一下... 有时候需要获取C++函数中返回来的不定长的char*字符串,目前我找到两种解决问题的思路,具体实现如下:

golang获取当前时间、时间戳和时间字符串及它们之间的相互转换方法

《golang获取当前时间、时间戳和时间字符串及它们之间的相互转换方法》:本文主要介绍golang获取当前时间、时间戳和时间字符串及它们之间的相互转换,本文通过实例代码给大家介绍的非常详细,感兴趣... 目录1、获取当前时间2、获取当前时间戳3、获取当前时间的字符串格式4、它们之间的相互转化上篇文章给大家介

Python获取中国节假日数据记录入JSON文件

《Python获取中国节假日数据记录入JSON文件》项目系统内置的日历应用为了提升用户体验,特别设置了在调休日期显示“休”的UI图标功能,那么问题是这些调休数据从哪里来呢?我尝试一种更为智能的方法:P... 目录节假日数据获取存入jsON文件节假日数据读取封装完整代码项目系统内置的日历应用为了提升用户体验,