Data Mining: Concepts and Techniques
اسلاید 1: January 3, 2018Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 1 —©Jiawei Han and Micheline KamberIntelligent Database Systems Research LabSchool of Computing Science Simon Fraser University, Canadahttp://www.cs.sfu.ca
اسلاید 2: January 3, 2018Data Mining: Concepts and Techniques2AcknowledgementsThis work on this set of slides started with my (Han’s) tutorial for UCLA Extension course in February 1998Dr. Hongjun Lu from Hong Kong Univ. of Science and Technology taught jointly with me a Data Mining Summer Course in Shanghai, China in July 1998. He has contributed many excellent slides to itSome graduate students have contributed many new slides in the following years. Notable contributors include Eugene Belchev, Jian Pei, and Osmar R. Zaiane (now teaching in Univ. of Alberta).
اسلاید 3: January 3, 2018Data Mining: Concepts and Techniques3 CMPT-459-00.3 Course ScheduleChapter 1. Introduction {W1:L2, L3}Chapter 2. Data warehousing and OLAP technology for data mining {W2:L1-3, W3:L1-2}Homework # 1 distribution (SQLServer7.0+ DBMiner2.0)Chapter 3. Data preprocessing {W3:L3, W4: L1-L2}Chapter 4. Data mining primitives, languages and system architectures {W4: L3, W5: L1}Homework #1 due, homework #2 distributionChapter 5. Concept description: Characterization and comparison {W5: L2, L3, W6: L2}W6:L1 Thanksgiving DayChapter 6. Mining association rules in large databases {W6: L3, W7: L1-3, W8: L2}Midterm {W8: L2} Chapter 7. Classification and prediction {W8:L3, W9: L1-L3}Chapter 8. Clustering analysis {W10: L1-L3}W10: L3 Homework #2 dueChapter 9. Mining complex types of data {W11: L2-L3, W12:L1-L3}W11:L1 Remembrance Day, W12:L3 Course project due Chapter 10. Data mining applications and trends in data mining {W13: L1-L3}Final Exam (W14)
اسلاید 4: January 3, 2018Data Mining: Concepts and Techniques4Where to Find the Set of Slides?Tutorial sections (MS PowerPoint files):http://www.cs.sfu.ca/~han/dmbookOther conference presentation slides (.ppt):http://db.cs.sfu.ca/ or http://www.cs.sfu.ca/~hanResearch papers, DBMiner system, and other related information: http://db.cs.sfu.ca/ or http://www.cs.sfu.ca/~han
اسلاید 5: January 3, 2018Data Mining: Concepts and Techniques5Chapter 1. IntroductionMotivation: Why data mining?What is data mining?Data Mining: On what kind of data?Data mining functionalityAre all the patterns interesting?Classification of data mining systemsMajor issues in data mining
اسلاید 6: January 3, 2018Data Mining: Concepts and Techniques6Motivation: “Necessity is the Mother of Invention”Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data miningData warehousing and on-line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
اسلاید 7: January 3, 2018Data Mining: Concepts and Techniques7Evolution of Database Technology (See Fig. 1.1)1960s:Data collection, database creation, IMS and network DBMS1970s: Relational data model, relational DBMS implementation1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases
اسلاید 8: January 3, 2018Data Mining: Concepts and Techniques8What Is Data Mining?Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databasesAlternative names and their “inside stories”: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.What is not data mining?(Deductive) query processing.  Expert systems or small ML/statistical programs
اسلاید 9: January 3, 2018Data Mining: Concepts and Techniques9Why Data Mining? — Potential ApplicationsDatabase analysis and decision supportMarket analysis and managementtarget marketing, customer relation management, market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysisFraud detection and managementOther ApplicationsText mining (news group, email, documents) and Web analysis.Intelligent query answering
اسلاید 10: January 3, 2018Data Mining: Concepts and Techniques10Market Analysis and Management (1)Where are the data sources for analysis?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studiesTarget marketingFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.Cross-market analysisAssociations/co-relations between product salesPrediction based on the association information
اسلاید 11: January 3, 2018Data Mining: Concepts and Techniques11Market Analysis and Management (2)Customer profilingdata mining can tell you what types of customers buy what products (clustering or classification)Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customersProvides summary informationvarious multidimensional summary reportsstatistical summary information (data central tendency and variation)
اسلاید 12: January 3, 2018Data Mining: Concepts and Techniques12Corporate Analysis and Risk ManagementFinance planning and asset evaluationcash flow analysis and predictioncontingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource planning:summarize and compare the resources and spendingCompetition:monitor competitors and market directions group customers into classes and a class-based pricing procedureset pricing strategy in a highly competitive market
اسلاید 13: January 3, 2018Data Mining: Concepts and Techniques13Fraud Detection and Management (1)Applicationswidely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.Approachuse historical data to build models of fraudulent behavior and use data mining to help identify similar instancesExamplesauto insurance: detect a group of people who stage accidents to collect on insurancemoney laundering: detect suspicious money transactions (US Treasurys Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
اسلاید 14: January 3, 2018Data Mining: Concepts and Techniques14Fraud Detection and Management (2)Detecting inappropriate medical treatmentAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).Detecting telephone fraudTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. RetailAnalysts estimate that 38% of retail shrink is due to dishonest employees.
اسلاید 15: January 3, 2018Data Mining: Concepts and Techniques15Other ApplicationsSportsIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami HeatAstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
اسلاید 16: January 3, 2018Data Mining: Concepts and Techniques16Data Mining: A KDD ProcessData mining: the core of knowledge discovery process.Data CleaningData IntegrationDatabasesData WarehouseKnowledgeTask-relevant DataSelectionData MiningPattern Evaluation
اسلاید 17: January 3, 2018Data Mining: Concepts and Techniques17Steps of a KDD Process Learning the application domain:relevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:Find useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining  summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentationvisualization, transformation, removing redundant patterns, etc.Use of discovered knowledge
اسلاید 18: January 3, 2018Data Mining: Concepts and Techniques18Data Mining and Business Intelligence Increasing potentialto supportbusiness decisionsEnd UserBusiness Analyst DataAnalystDBA MakingDecisionsData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationOLAP, MDAStatistical Analysis, Querying and ReportingData Warehouses / Data MartsData SourcesPaper, Files, Information Providers, Database Systems, OLTP
اسلاید 19: January 3, 2018Data Mining: Concepts and Techniques19Architecture of a Typical Data Mining SystemData WarehouseData cleaning & data integrationFilteringDatabasesDatabase or data warehouse serverData mining enginePattern evaluationGraphical user interfaceKnowledge-base
اسلاید 20: January 3, 2018Data Mining: Concepts and Techniques20Data Mining: On What Kind of Data?Relational databasesData warehousesTransactional databasesAdvanced DB and information repositoriesObject-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataText databases and multimedia databasesHeterogeneous and legacy databasesWWW
اسلاید 21: January 3, 2018Data Mining: Concepts and Techniques21Data Mining Functionalities (1)Concept description: Characterization and discriminationGeneralize, summarize, and contrast data characteristics, e.g., dry vs. wet regionsAssociation (correlation and causality)Multi-dimensional vs. single-dimensional association age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%]contains(T, “computer”) à contains(x, “software”) [1%, 75%]
اسلاید 22: January 3, 2018Data Mining: Concepts and Techniques22Data Mining Functionalities (2)Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future predictionE.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision-tree, classification rule, neural networkPrediction: Predict some unknown or missing numerical values Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsClustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
اسلاید 23: January 3, 2018Data Mining: Concepts and Techniques23Data Mining Functionalities (3)Outlier analysisOutlier: a data object that does not comply with the general behavior of the dataIt can be considered as noise or exception but is quite useful in fraud detection, rare events analysisTrend and evolution analysisTrend and deviation: regression analysisSequential pattern mining, periodicity analysisSimilarity-based analysisOther pattern-directed or statistical analyses
اسلاید 24: January 3, 2018Data Mining: Concepts and Techniques24Are All the “Discovered” Patterns Interesting?A data mining system/query may generate thousands of patterns, not all of them are interesting.Suggested approach: Human-centered, query-based, focused miningInterestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
اسلاید 25: January 3, 2018Data Mining: Concepts and Techniques25Can We Find All and Only Interesting Patterns?Find all the interesting patterns: CompletenessCan a data mining system find all the interesting patterns?Association vs. classification vs. clusteringSearch for only interesting patterns: OptimizationCan a data mining system find only the interesting patterns?ApproachesFirst general all the patterns and then filter out the uninteresting ones.Generate only the interesting patterns—mining query optimization
اسلاید 26: January 3, 2018Data Mining: Concepts and Techniques26Data Mining: Confluence of Multiple Disciplines Data MiningDatabase TechnologyStatisticsOtherDisciplinesInformationScienceMachineLearningVisualization
اسلاید 27: January 3, 2018Data Mining: Concepts and Techniques27Data Mining: Classification SchemesGeneral functionalityDescriptive data mining Predictive data miningDifferent views, different classificationsKinds of databases to be minedKinds of knowledge to be discoveredKinds of techniques utilizedKinds of applications adapted
اسلاید 28: January 3, 2018Data Mining: Concepts and Techniques28A Multi-Dimensional View of Data Mining ClassificationDatabases to be minedRelational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.Knowledge to be minedCharacterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.Multiple/integrated functions and mining at multiple levelsTechniques utilizedDatabase-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.Applications adaptedRetail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.
اسلاید 29: January 3, 2018Data Mining: Concepts and Techniques29OLAP Mining: An Integration of Data Mining and Data WarehousingData mining systems, DBMS, Data warehouse systems couplingNo coupling, loose-coupling, semi-tight-coupling, tight-couplingOn-line analytical mining dataintegration of mining and OLAP technologiesInteractive mining multi-level knowledgeNecessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.Integration of multiple mining functions Characterized classification, first clustering and then association
اسلاید 30: January 3, 2018Data Mining: Concepts and Techniques30An OLAM ArchitectureData WarehouseMeta DataMDDBOLAMEngineOLAPEngineUser GUI APIData Cube APIDatabase APIData cleaningData integrationLayer3OLAP/OLAMLayer2MDDBLayer1Data RepositoryLayer4User InterfaceFiltering&IntegrationFilteringDatabasesMining queryMining result
اسلاید 31: January 3, 2018Data Mining: Concepts and Techniques31Major Issues in Data Mining (1)Mining methodology and user interactionMining different kinds of knowledge in databasesInteractive mining of knowledge at multiple levels of abstractionIncorporation of background knowledgeData mining query languages and ad-hoc data miningExpression and visualization of data mining resultsHandling noise and incomplete dataPattern evaluation: the interestingness problemPerformance and scalabilityEfficiency and scalability of data mining algorithmsParallel, distributed and incremental mining methods
اسلاید 32: January 3, 2018Data Mining: Concepts and Techniques32Major Issues in Data Mining (2)Issues relating to the diversity of data typesHandling relational and complex types of dataMining information from heterogeneous databases and global information systems (WWW)Issues related to applications and social impactsApplication of discovered knowledgeDomain-specific data mining toolsIntelligent query answeringProcess control and decision makingIntegration of the discovered knowledge with existing knowledge: A knowledge fusion problemProtection of data security, integrity, and privacy
اسلاید 33: January 3, 2018Data Mining: Concepts and Techniques33SummaryData mining: discovering interesting patterns from large amounts of dataA natural evolution of database technology, in great demand, with wide applicationsA KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationMining can be performed in a variety of information repositoriesData mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.Classification of data mining systemsMajor issues in data mining
اسلاید 34: January 3, 2018Data Mining: Concepts and Techniques34A Brief History of Data Mining Society1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)1991-1994 Workshops on Knowledge Discovery in DatabasesAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)Journal of Data Mining and Knowledge Discovery (1997)1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD ExplorationsMore conferences on data miningPAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.
اسلاید 35: January 3, 2018Data Mining: Concepts and Techniques35Where to Find References?Data mining and KDD (SIGKDD member CDROM):Conference proceedings: KDD, and others, such as PKDD, PAKDD, etc.Journal: Data Mining and Knowledge DiscoveryDatabase field (SIGMOD member CD ROM):Conference proceedings: ACM-SIGMOD, ACM-PODS, VLDB, ICDE, EDBT, DASFAAJournals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.AI and Machine Learning:Conference proceedings: Machine learning, AAAI, IJCAI, etc.Journals: Machine Learning, Artificial Intelligence, etc.Statistics:Conference proceedings: Joint Stat. Meeting, etc.Journals: Annals of statistics, etc.Visualization:Conference proceedings: CHI, etc.Journals: IEEE Trans. visualization and computer graphics, etc.
اسلاید 36: January 3, 2018Data Mining: Concepts and Techniques36ReferencesU. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.
اسلاید 37: January 3, 2018Data Mining: Concepts and Techniques37http://www.cs.sfu.ca/~hanThank you !!!
اسلاید 38: January 3, 2018Data Mining: Concepts and Techniques38 CMPT-843 Course Arrangement1st week: full instructor teaching2nd to 11th week: 1/2 graduate student + 1/2 instructor teaching12-13th week: full student graduate project presentationCourse evaluation:presentation (quality of presentation slides 7% + presentation 8%) 15%midterm exam 35%project (presentation 5% + report 25%) total 30%homework (2): 20%Deadline for the selection of your work in the semester:selection of course presentation: at the end of the 1st weekselection of the course project: at the end of the 3rd weekproject proposal due date: at the end of the 4th weekhomework due dates: project due date: end of the semesterYour presentation slides due date: one day before the presentationmidterm date: end of the 8th week
نقد و بررسی ها
هیچ نظری برای این پاورپوینت نوشته نشده است.