转载

挖掘数据流 Mining Data Stream

数据流 挖掘:

  1. 输入元组通过一个或多个端口快速地进入
  2. 系统无法存储整个 数据流
  3. 在有限内存的条件下对 数据流 做出一些计算

对数据流的两类Query:

  1. Ad-hoc queries,例如当前数据流中的最大值是多少
  2. standing queries, 例如每当数据流中出现了新的最大值就汇报一次

滑动窗口:

  • Query时总是针对数据流中固定长度N的一个窗口(最新进入系统的N个数据)
  • 或者:最近T时间内进入系统的数据

当窗口的长度大到无法存储于主存内时,或者有太多不同数据流,内存不够存储多个窗口,则需要更有趣的方法。

例子:平均数

  • 数据流为一个个整数
  • 窗口尺寸 挖掘数据流 Mining Data Stream
  • standing query: 当前窗口中数字的平均值为多少?

方法:

  1. 对于前 挖掘数据流 Mining Data Stream 个数据,计算其平均值 挖掘数据流 Mining Data Stream
  2. 对于每一个新来的数据 挖掘数据流 Mining Data Stream ,找到当前窗口中最旧的数据 挖掘数据流 Mining Data Stream ,更新平均值为 挖掘数据流 Mining Data Stream

Python

import random # sliding window average: stream = [random.random() for i in range(99)] def slidingWindowAverage(stream, windowSize = 10):  window = stream[:windowSize]  avg = mean(window)  index = windowSize  while True:   option = raw_input('n for new input, e for exit: ')   if option == 'n':    j = window.pop(0)    i = stream[index]    window.append(i)    index += 1    avg += float(i-j)/windowSize    print 'current average value is {0}'.format(avg)   if index >= len(stream) or option == 'e':    break  return avg avg = slidingWindowAverage(stream) n for new input, e for exit: n current average value is 0.524461680244 n for new input, e for exit: n current average value is 0.530545388643 n for new input, e for exit: n current average value is 0.540916392055 n for new input, e for exit: n current average value is 0.502333068095 n for new input, e for exit: e 

Counting Bits 问题:

  • 给定一个数据为 挖掘数据流 Mining Data Stream 挖掘数据流 Mining Data Stream 的数据流,希望回答的问题是:最近k个数据中有多少个 挖掘数据流 Mining Data Stream 挖掘数据流 Mining Data Stream
  • 如果存储最近的 挖掘数据流 Mining Data Stream 个数据,很容易回答。
  • 假设 挖掘数据流 Mining Data Stream 均很大,并且有非常多数据流同时流入并且需要计算,无法这样做。

DGIM 算法:

  • 对每一个数据进行timestamp
  • 对窗口内的数据进行summary为一个个bucket
  • 每个bucket代表窗口内报括了 挖掘数据流 Mining Data Stream 挖掘数据流 Mining Data Stream 的一段数据流,记录为(该段的末尾位置, 挖掘数据流 Mining Data Stream 挖掘数据流 Mining Data Stream 为该bucket的大小
  • 如果同样大小的bucket出现了 挖掘数据流 Mining Data Stream 次,就进行合并,丢掉时间戳较老的,更新较新的那个味(该段的末尾位置, 挖掘数据流 Mining Data Stream )

挖掘数据流 Mining Data Stream

Query,最近 挖掘数据流 Mining Data Stream

个数据中有多少个

挖掘数据流 Mining Data Stream

  • 只看结尾时间在最近k时间以内的bucket
  • 将所有bucket的大小加起来,最旧的一个bucket的值只算一半
  • 这样的误差,最大可能会有 挖掘数据流 Mining Data Stream

Python

def dgimTimeStamp(stream, windowSize, t = 0, buckets = dict()):  # input n to add a new bit  # input e to exit program  # input a number ot query  sizes = [2**i for i in range(int(round(math.log(windowSize,2))))]  while True:   option = raw_input('e to exit; n to add 1 bit; any number to query  ')   if option == 'n':    # remove too old buckets    timestamp = t%windowSize    for key in buckets.keys():     if key == timestamp:      buckets.pop(key)    # adding a new bit in    if stream[t] == 1:     buckets[timestamp] = 1    t += 1    # buckets cascading updating    for size in sizes:     if sum(np.array(buckets.values()) == size) == 3:      keys = [b for b in buckets.keys() if buckets[b] == size]      for i in range(len(keys)):       if keys[i] <= timestamp:        keys[i] += windowSize      keys.sort()      tbr = keys[0]      if tbr < windowSize:       buckets.pop(tbr)      else:       buckets.pop(tbr - windowSize)      tbm = keys[1]      if tbm < windowSize:       buckets[tbm] *= 2      else:       buckets[tbm-windowSize] *= 2    if t >= len(stream):     break   elif option == 'e':    break   else:    k = int(option)    keys = [i%windowSize for i in range(t-k,t)]    initial = True    s = 0    for key in keys:     if key in buckets and initial == False:      s += buckets[key]     if key in buckets and initial == True:      s += buckets[key]*0.5      initial =False    print s, t   print buckets  return buckets, t stream = [1 for i in range(99)] b, t = dgimTimeStamp(stream, 5) e to exit; n to add 1 bit; any number to query  n {0: 1} e to exit; n to add 1 bit; any number to query  n {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  1 0.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  2 1.5 2 {0: 1, 1: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 2: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 2: 1, 3: 1} e to exit; n to add 1 bit; any number to query  n {1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query  n {0: 1, 1: 2, 3: 2, 4: 1} e to exit; n to add 1 bit; any number to query  n {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  2 2.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  4 4.0 7 {0: 2, 1: 1, 3: 2} e to exit; n to add 1 bit; any number to query  e 
正文到此结束
Loading...