Bài giảng môn học Kho dữ liệu và khai phá dữ liệu - Chương 3: Giới thiệu chung về kho dữ liệu

Khái niệm kho dữ liệu

Mô hình dữ liệu đa chiều

Kiến trúc kho dữ liệu

Thi hành kho dữ liệu

Từ xây dựng kho dữ liệu tới KPDL

Sự phát triển mới của công nghệ khối dữ liệu

129 trang | Chia sẻ: Mr Hưng | Lượt xem: 1376 | Lượt tải: 1Free

Bạn đang xem trước 20 trang nội dung tài liệu Bài giảng môn học Kho dữ liệu và khai phá dữ liệu - Chương 3: Giới thiệu chung về kho dữ liệu, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

m sốSelfExp: chụp các ô liên quan tới các ô khác cùngmức tổng hợpInExp: chụp phía dưới của ôPathExp: chụp phía dưới của ô theo mỗi đường khoan xuốngTính toán chỉ số loại trừ (thiết đặt mô hình và tính các giá trị SelfExp, InExp, and PathExp) có thể gối vào xây dựng khốiLoại từ tự chính có thể được bảo quản, chỉ mục và tìm kiếm như một tổng hợp tính toán trước*Kho dữ liệu và khai phá dữ liệu: Chương 3*Ví dụ: Khối dữ liệu theo phát hiện*Kho dữ liệu và khai phá dữ liệu: Chương 3*Complex Aggregation at Multiple Granularities: Multi-Feature CubesMulti-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularitiesEx. Grouping by all subsets of {item, region, month}, find the maximum price in 1997 for each group, and the total sales among all maximum price tuplesselect item, region, month, max(price), sum(R.sales)from purchaseswhere year = 1997cube by item, region, month: Rsuch that R.price = max(price)Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples*Kho dữ liệu và khai phá dữ liệu: Chương 3*Cube-Gradient (Cubegrade)Analysis of changes of sophisticated measures in multi-dimensional spacesQuery: changes of average house price in Vancouver in ‘00 comparing against ’99Answer: Apts in West went down 20%, houses in Metrotown went up 10%Cubegrade problem by Imielinski et al.Changes in dimensions  changes in measuresDrill-down, roll-up, and mutation*Kho dữ liệu và khai phá dữ liệu: Chương 3*From Cubegrade to Multi-dimensional Constrained Gradients in Data CubesSignificantly more expressive than association rulesCapture trends in user-specified measuresSerious challengesMany trivial cells in a cube  “significance constraint” to prune trivial cellsNumerate pairs of cells  “probe constraint” to select a subset of cells to examineOnly interesting changes wanted “gradient constraint” to capture significant changes*Kho dữ liệu và khai phá dữ liệu: Chương 3*MD Constrained Gradient MiningSignificance constraint Csig: (cnt100)Probe constraint Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)Gradient constraint Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)1.3)DimensionsMeasurescidYrCityCst_grpPrd_grpCntAvg_pricec100VanBusiPC3002100c2*VanBusiPC28001800c3*TorBusiPC79002350c4**busiPC586002250Base cellAggregated cellSiblingsAncestorProbe cell: satisfied Cprb(c4, c2) satisfies Cgrad!*Kho dữ liệu và khai phá dữ liệu: Chương 3*A LiveSet-Driven AlgorithmCompute probe cells using Csig and CprbThe set of probe cells P is often very smallUse probe P and constraints to find gradientsPushing selection deeplySet-oriented processing for probe cellsIceberg growing from low to high dimensionalitiesDynamic pruning probe cells during growthIncorporating efficient iceberg cubing method*Kho dữ liệu và khai phá dữ liệu: Chương 3*From On-Line Analytical Processing to On Line Analytical Mining (OLAM)Why online analytical mining?High quality of data in data warehousesDW contains integrated, consistent, cleaned dataAvailable information processing structure surrounding data warehousesODBC, OLEDB, Web accessing, service facilities, reporting and OLAP toolsOLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.On-line selection of data mining functionsintegration and swapping of multiple mining functions, algorithms, and tasks.Architecture of OLAM*Kho dữ liệu và khai phá dữ liệu: Chương 3*OLAP Mining: Integration of Data Mining and Data WarehousingData mining systems, DBMS, Data warehouse systems couplingNo coupling, loose-coupling, semi-tight-coupling, tight-couplingOn-line analytical mining dataintegration of mining and OLAP technologiesInteractive mining multi-level knowledgeNecessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.Integration of multiple mining functions Characterized classification, first clustering and then association*Kho dữ liệu và khai phá dữ liệu: Chương 3*An OLAM ArchitectureData WarehouseMeta DataMDDBOLAMEngineOLAPEngineUser GUI APIData Cube APIDatabase APIData cleaningData integrationLayer3OLAP/OLAMLayer2MDDBLayer1Data RepositoryLayer4User InterfaceFiltering&IntegrationFilteringDatabasesMining queryMining result*Kho dữ liệu và khai phá dữ liệu: Chương 3*Kinds of Exceptions and their Computation01/6/09Parameters (thăm dò có điều khiển) ba chỉ số hấp dẫnSelfExp: surprise of cell relative to other cells at same level of aggregationInExp: surprise beneath the cellPathExp: surprise beneath cell for each drill-down pathComputation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube constructionException themselves can be stored, indexed and retrieved like precomputed aggregates*Kho dữ liệu và khai phá dữ liệu: Chương 3*Examples: Discovery-Driven Data Cubesví dụ item, time, region*Kho dữ liệu và khai phá dữ liệu: Chương 3*Complex Aggregation at Multiple Granularities: Multi-Feature CubesMulti-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularitiesEx.4.17: Grouping by all subsets of {item, region, month}, find the maximum price in 2004 for each group, and the total sales among all maximum price tuples – cú pháp SQL mở rộngselect item, region, month, max(price), sum(R.sales)from purchaseswhere year = 2004cube by item, region, month: R (mở rộng group by)such that R.price = max(price)Continuing the example 4.18, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples*Kho dữ liệu và khai phá dữ liệu: Chương 3*Cube-Gradient (Cubegrade)Analysis of changes of sophisticated measures in multi-dimensional spacesQuery: changes of average house price in Vancouver in ‘00 comparing against ’99Answer: Apts in West went down 20%, houses in Metrotown went up 10%Cubegrade problem by Imielinski et al.Changes in dimensions  changes in measuresDrill-down, roll-up, and mutation*Kho dữ liệu và khai phá dữ liệu: Chương 3*From Cubegrade to Multi-dimensional Constrained Gradients in Data CubesSignificantly more expressive than association rulesCapture trends in user-specified measuresSerious challengesMany trivial cells in a cube  “significance constraint” to prune trivial cellsNumerate pairs of cells  “probe constraint” to select a subset of cells to examineOnly interesting changes wanted “gradient constraint” to capture significant changes*Kho dữ liệu và khai phá dữ liệu: Chương 3*MD Constrained Gradient MiningSignificance constraint Csig: (cnt100)Probe constraint Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)Gradient constraint Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)1.3)DimensionsMeasurescidYrCityCst_grpPrd_grpCntAvg_pricec100VanBusiPC3002100c2*VanBusiPC28001800c3*TorBusiPC79002350c4**busiPC586002250Base cellAggregated cellSiblingsAncestorProbe cell: satisfied Cprb(c4, c2) satisfies Cgrad!*Kho dữ liệu và khai phá dữ liệu: Chương 3*A LiveSet-Driven Algorithmsig: ràng buộc dấu, ràng buộ dòCompute probe cells using Csig and CprbThe set of probe cells P is often very smallUse probe P and constraints to find gradientsPushing selection deeplySet-oriented processing for probe cellsIceberg growing from low to high dimensionalitiesDynamic pruning probe cells during growthIncorporating efficient iceberg cubing method*Kho dữ liệu và khai phá dữ liệu: Chương 3*Chương 3: Cơ sở về kho dữ liệuKhái niệm kho dữ liệuMô hình dữ liệu đa chiềuKiến trúc kho dữ liệuThi hành kho dữ liệuTừ xây dựng kho dữ liệu tới KPDLSự phát triển mới của công nghệ khối dữ liệu*Kho dữ liệu và khai phá dữ liệu: Chương 3*Iceberg CubeComputing only the cuboid cells whose count or other aggregates satisfying the condition:HAVING COUNT(*) >= minsupMotivationOnly a small portion of cube cells may be “above the water’’ in a sparse cubeOnly calculate “interesting” data—data above certain thresholdSuppose 100 dimensions, only 1 base cell. How many aggregate (non-base) cells if count >= 1? What about count >= 2?*Kho dữ liệu và khai phá dữ liệu: Chương 3*Bottom-Up Computation (BUC)BUC (Beyer & Ramakrishnan, SIGMOD’99) Bottom-up vs. top-down?—depending on how you view it!Apriori property:Aggregate the data, then move to the next levelIf minsup is not met, stop!If minsup = 1 Þ compute full CUBE!*Kho dữ liệu và khai phá dữ liệu: Chương 3*PartitioningUsually, entire data set can’t fit in main memorySort distinct values, partition into blocks that fitContinue processingOptimizationsPartitioningExternal Sorting, Hashing, Counting SortOrdering dimensions to encourage pruningCardinality, Skew, CorrelationCollapsing duplicatesCan’t do holistic aggregates anymore!*Kho dữ liệu và khai phá dữ liệu: Chương 3*Drawbacks of BUCRequires a significant amount of memoryOn par with most other CUBE algorithms thoughDoes not obtain good performance with dense CUBEsOverly skewed data or a bad choice of dimension ordering reduces performanceCannot compute iceberg cubes with complex measuresCREATE CUBE Sales_Iceberg ASSELECT month, city, cust_grp, AVG(price), COUNT(*)FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND COUNT(*) >= 50*Kho dữ liệu và khai phá dữ liệu: Chương 3*Non-Anti-Monotonic MeasuresThe cubing query with avg is non-anti-monotonic! (Mar, *, *, 600, 1800) fails the HAVING clause(Mar, *, Bus, 1300, 360) passes the clauseCREATE CUBE Sales_Iceberg ASSELECT month, city, cust_grp, AVG(price), COUNT(*)FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND COUNT(*) >= 50MonthCityCust_grpProdCostPriceJanTorEduPrinter500485JanTorHldTV8001200JanTorEduCamera11601280FebMonBusLaptop15002500MarVanEduHD540520*Kho dữ liệu và khai phá dữ liệu: Chương 3*Top-k AverageLet (*, Van, *) cover 1,000 recordsAvg(price) is the average price of those 1000 salesAvg50(price) is the average price of the top-50 sales (top-50 according to the sales priceTop-k average is anti-monotonicThe top 50 sales in Van. is with avg(price) = 800Large value collapsing: use a sum and a count to summarize records with measure >= 800If count>=800, no need to check “small” recordsSmall value binning: a group of binsOne bin covers a range, e.g., 600~800, 400~600, etc.Register a sum and a count for each bin*Kho dữ liệu và khai phá dữ liệu: Chương 3*Approximate top-k averageRangeSumCountOver 8002800020600~8001060015400~6001520030Top 50Approximate avg50()=(28000+10600+600*15)/50=952Suppose for (*, Van, *), we haveMonthCityCust_grpProdCostPriceThe cell may pass the HAVING clause*Kho dữ liệu và khai phá dữ liệu: Chương 3*Quant-info for Top-k Average BinningAccumulate quant-info for cells to compute average iceberg cubes efficientlyThree pieces: sum, count, top-k binsUse top-k bins to estimate/prune descendantsUse sum and count to consolidate current cellApproximate avg50()Anti-monotonic, can be computed efficientlyreal avg50()Anti-monotonic, but computationally costlyavg()Not anti-monotonicstrongestweakest*Kho dữ liệu và khai phá dữ liệu: Chương 3*An Efficient Iceberg Cubing Method: Top-k H-CubingOne can revise Apriori or BUC to compute a top-k avg iceberg cube. This leads to top-k-Apriori and top-k BUC.Can we compute iceberg cube more efficiently?Top-k H-cubing: an efficient method to compute iceberg cubes with average measureH-tree: a hyper-tree structureH-cubing: computing iceberg cubes using H-tree*Kho dữ liệu và khai phá dữ liệu: Chương 3*H-tree: A Prefix Hyper-treeMonthCityCust_grpProdCostPriceJanTorEduPrinter500485JanTorHhdTV8001200JanTorEduCamera11601280FebMonBusLaptop15002500MarVanEduHD540520rooteduhhdbusJanMarJanFebTorVanTorMonQ.I.Q.I.Q.I.Quant-InfoSum: 1765Cnt: 2binsAttr. Val.Quant-InfoSide-linkEduSum:2285 HhdBusJanFebTorVanMonHeadertable*Kho dữ liệu và khai phá dữ liệu: Chương 3*Properties of H-treeConstruction cost: a single database scanCompleteness: It contains the complete information needed for computing the iceberg cubeCompactness: # of nodes  n*m+1n: # of tuples in the tablem: # of attributes *Kho dữ liệu và khai phá dữ liệu: Chương 3*Computing Cells Involving Dimension CityrootEdu.Hhd.Bus.Jan.Mar.Jan.Feb.Tor.Van.Tor.Mon.Q.I.Q.I.Q.I.Quant-InfoSum: 1765Cnt: 2binsAttr. Val.Quant-InfoSide-linkEduSum:2285 HhdBusJanFebTorVanMonAttr. Val.Q.I.Side-linkEduHhdBusJanFebHeaderTableHTorFrom (*, *, Tor) to (*, Jan, Tor)*Kho dữ liệu và khai phá dữ liệu: Chương 3*Computing Cells Involving Month But No CityrootEdu.Hhd.Bus.Jan.Mar.Jan.Feb.Tor.Van.Tor.Mont.Q.I.Q.I.Q.I.Attr. Val.Quant-InfoSide-linkEdu.Sum:2285 Hhd.Bus.Jan.Feb.Mar.Tor.Van.Mont.Roll up quant-infoCompute cells involving month but no cityQ.I.Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed!*Kho dữ liệu và khai phá dữ liệu: Chương 3*Computing Cells Involving Only Cust_grprooteduhhdbusJanMarJanFebTorVanTorMonQ.I.Q.I.Q.I.Attr. Val.Quant-InfoSide-linkEduSum:2285 HhdBusJanFebMarTorVanMonCheck header table directlyQ.I.*Kho dữ liệu và khai phá dữ liệu: Chương 3*Properties of H-CubingSpace costan H-tree a stack of up to (m-1) header tablesOne database scanMain memory-based tree traversal & side-links updatesTop-k_OK marking*Kho dữ liệu và khai phá dữ liệu: Chương 3*Scalability w.r.t. Count Threshold (No min_avg Setting)*Kho dữ liệu và khai phá dữ liệu: Chương 3*Computing Iceberg Cubes with Other Complex MeasuresComputing other complex measuresKey point: find a function which is weaker but ensures certain anti-monotonicityExamplesAvg()  v: avgk(c)  v (bottom-k avg)Avg()  v only (no count): max(price)  v Sum(profit) (profit can be negative): p_sum(c)  v if p_count(c)  k; or otherwise, sumk(c)  v Others: conjunctions of multiple conditions*Kho dữ liệu và khai phá dữ liệu: Chương 3*Discussion: Other IssuesComputing iceberg cubes with more complex measures? No general answer for holistic measures, e.g., median, mode, rankA research theme even for complex algebraic functions, e.g., standard_dev, varianceDynamic vs . static computation of iceberg cubesv and k are only available at query timeSetting reasonably low parameters for most nontrivial casesMemory-hog? what if the cubing is too big to fit in memory?—projection and then cubing*Kho dữ liệu và khai phá dữ liệu: Chương 3*Condensed CubeW. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.Icerberg cube cannot solve all the problemsSuppose 100 dimensions, only 1 base cell with count = 10. How many aggregate (non-base) cells if count >= 10? Condensed cubeOnly need to store one cell (a1, a2, , a100, 10), which represents all the corresponding aggregate cellsAdv.Fully precomputed cube without compressionEfficient computation of the minimal condensed cube

Các file đính kèm theo tài liệu này:

dm_dw_k17_c4_3296.ppt