上一篇性能测试,看到耗时和内存占用上的一些现象。当然,对于一个开源的东西,最高效的方式还是研究源代码了。接下来我们会深入到ICU源代码简要看看分词的实现方法。
以我们例子中用到的获取当前默认Locale的WordInstance为例:
java.text.BreakIterator.getWordInstance() -> java.text.BreakIterator.getWordInstance(Locale.getDefault()) -> java.text.IcuIteratorWrapper new IcuIteratorWrapper() -> android.icu.text.BreakIterator.getWordInstance(ULocale where) -> android.icu.text.BreakIterator.getBreakInstance(where, KIND_WORD)OK,到了第一个关键逻辑: android.icu.text.BreakIterator
/** * Returns a particular kind of BreakIterator for a locale. * Avoids writing a switch statement with getXYZInstance(where) calls. * @internal * @deprecated This API is ICU internal only. */ @Deprecated public static BreakIterator getBreakInstance(ULocale where, int kind) { if (where == null) { throw new NullPointerException("Specified locale is null"); } if (iterCache[kind] != null) { BreakIteratorCache cache = (BreakIteratorCache)iterCache[kind].get(); if (cache != null) { if (cache.getLocale().equals(where)) { return cache.createBreakInstance(); } } } // sigh, all to avoid linking in ICULocaleData... BreakIterator result = getShim().createBreakIterator(where, kind); BreakIteratorCache cache = new BreakIteratorCache(where, result); iterCache[kind] = new SoftReference<BreakIteratorCache>(cache); if (result instanceof RuleBasedBreakIterator) { RuleBasedBreakIterator rbbi = (RuleBasedBreakIterator)result; rbbi.setBreakType(kind); } return result; }从上面初始化的逻辑可以看到几点: (1)使用一个软引用数组iterCache来缓存:
/** * {@icu} * @stable ICU 2.4 */ public static final int KIND_CHARACTER = 0; /** * {@icu} * @stable ICU 2.4 */ public static final int KIND_WORD = 1; /** * {@icu} * @stable ICU 2.4 */ public static final int KIND_LINE = 2; /** * {@icu} * @stable ICU 2.4 */ public static final int KIND_SENTENCE = 3; /** * {@icu} * @stable ICU 2.4 */ public static final int KIND_TITLE = 4; /** * @since ICU 2.8 */ private static final int KIND_COUNT = 5; private static final SoftReference<?>[] iterCache = new SoftReference<?>[5];我们知道,在虚拟机堆内存充裕的情况下软引用对象可以被使用,如果内存不充裕,软引用的对象会被GC回收。 (2)缓存的对象对应的类是BreakIteratorCache,这是BreakIterator的一个静态私有内部类:
private static final class BreakIteratorCache { private BreakIterator iter; private ULocale where; BreakIteratorCache(ULocale where, BreakIterator iter) { this.where = where; this.iter = (BreakIterator) iter.clone(); } ULocale getLocale() { return where; } BreakIterator createBreakInstance() { return (BreakIterator) iter.clone(); } }(3)缓存没有命中的情况。通过如下代码生成并保存一个缓存对象:
BreakIterator result = getShim().createBreakIterator(where, kind); BreakIteratorCache cache = new BreakIteratorCache(where, result); iterCache[kind] = new SoftReference<BreakIteratorCache>(cache);从缓存类源代码可以看到,缓存中的iter引用由getShim().createBreakIterator()得到的BreakIterator对象,这个BreakIterator对象也会直接返回给初始化方法的调用者。 (4)缓存命中的情况。如果kind和Locale都命中,那么会使用缓存中的BreakIteratorCache对象通过BreakIteratorCache.createBreakInstance(),其实也就是(BreakIterator) iter.clone(),克隆一个BreakIterator对象返回。
到这里,这段代码实际上是有很多疑问的,为什么写成这样?带着疑问继续研究。接下来看看初始化中的干货代码:
getShim().createBreakIterator(where, kind);先看一下getShim():
private static BreakIteratorServiceShim shim; private static BreakIteratorServiceShim getShim() { // Note: this instantiation is safe on loose-memory-model configurations // despite lack of synchronization, since the shim instance has no state-- // it's all in the class init. The worst problem is we might instantiate // two shim instances, but they'll share the same state so that's ok. if (shim == null) { try { Class<?> cls = Class.forName("com.ibm.icu.text.BreakIteratorFactory"); shim = (BreakIteratorServiceShim)cls.newInstance(); } catch (MissingResourceException e) { throw e; } catch (Exception e) { ///CLOVER:OFF if(DEBUG){ e.printStackTrace(); } throw new RuntimeException(e.getMessage()); ///CLOVER:ON } } return shim; }这是一个懒惰式初始化,有两个特点:第一,没有做同步以实现线程安全;第二,使用反射获取类BreakIteratorFactory并创建对象。 没有实现线程安全已经有注释来解释:初始化的逻辑实现于类加载(类初始化中),对象本身是无状态的,状态都在类中,即使最差情况下非单例,也没关系。也就是说,关键的逻辑是放在类加载中来间接实现线程安全。 为什么使用反射(类BreakIteratorFactory就位于同一个package中)?因为初始化的逻辑放到了BreakIteratorFactory的类加载中,同时,需要实现懒惰式初始化,所以不能够允许类BreakIteratorFactory在第一次调用getShim()之前被加载。我们知道,JVM规范对于类加载的时机没有硬性规定,只要求在使用的必须加载完毕(当然,类初始化也就完成了)。时候所以不同的JVM完全可以根据自身的策略来选择加载时机,也就有可能会被预加载,这就不能满足懒惰式初始化了。而使用反射,JVM预先并不知道需要使用这个类,所以只有在getShim()运行时加载类。在IDE中查看BreakIteratorFactory的使用情况为never used,也是佐证。
接下来分析BreakIteratorFactory。BreakIteratorFactory类初始化逻辑有哪些? (1)初始化常量数组
/** KIND_NAMES are the resource key to be used to fetch the name of the * pre-compiled break rules. The resource bundle name is "boundaries". * The value for each key will be the rules to be used for the * specified locale - "word" -> "word_th" for Thai, for example. */ private static final String[] KIND_NAMES = { "grapheme", "word", "line", "sentence", "title" };(2)初始化service
static final ICULocaleService service = new BFService();service这个静态成员变量很重要,前文分析过,初始化中的干货代码:
getShim().createBreakIterator(where, kind);而从BreakIteratorFactory来看,service与创建实例有关:
public BreakIterator createBreakIterator(ULocale locale, int kind) { // TODO: convert to ULocale when service switches over if (service.isDefault()) { return createBreakInstance(locale, kind); } ULocale[] actualLoc = new ULocale[1]; BreakIterator iter = (BreakIterator)service.get(locale, kind, actualLoc); iter.setLocale(actualLoc[0], actualLoc[0]); // services make no distinction between actual & valid return iter; }看一下静态私有内部类BFService:
private static class BFService extends ICULocaleService { BFService() { super("BreakIterator"); class RBBreakIteratorFactory extends ICUResourceBundleFactory { protected Object handleCreate(ULocale loc, int kind, ICUService srvc) { return createBreakInstance(loc, kind); } } registerFactory(new RBBreakIteratorFactory()); markDefault(); } /** * createBreakInstance() returns an appropriate BreakIterator for any locale. * It falls back to root if there is no specific data. * * <p>Without this override, the service code would fall back to the default locale * which is not desirable for an algorithm with a good Unicode default, * like break iteration. */ @Override public String validateFallbackLocale() { return ""; } }在他的构造方法中,把一个包装了BreakIteratorFactory.createBreakInstance()方法的局部类RBBreakIteratorFactory的实例注册到自身的工厂集合中,并标记为默认。这样如果没有其他的工厂实例注册进来,默认就会使用BreakIteratorFactory.createBreakInstance()。详见前面介绍的BreakIteratorFactory.createBreakIterator()方法。 再补充一点,前面提到的注释中“对象本身是无状态的,状态都在类中”,应该即指service。
