PuglieseWeb
  • Home
  • Software development
    • Cloud Data Security Principles
      • Separation of Duties (SoD)
      • Security Controls and Data Protection Framework
      • Vaultless Tokenization
    • Multi-cloud strategies
    • DMS
      • How CDC Checkpoints Work
      • Oracle to PostgreSQL Time-Window Data Reload Implementation Guide
      • Join tables separate PostgreSQL databases
      • Multi-Stage Migration Implementation Plan
      • Notes
      • Oracle Golden Gate to PostgreSQL Migration
      • Step-by-Step CDC Recovery Guide: Oracle to PostgreSQL Migration
    • AWS Pro
      • My notes
        • Data Migration Strategy
        • OpsWorks VS CloudFormation
      • Implementation Guides
        • AWS Lambda Scaling and Concurrency Optimization Guide
        • Understanding Cross-Account IAM Roles in AWS
        • HA TCP with Redundant DNS
        • Understanding 429 (Too Many Requests) & Throttling Pattern
        • EC2 Auto Scaling Log Collection Solutions Comparison
        • AWS PrivateLink Implementation Guide for Third-Party SaaS Integration
        • AWS Cross-Account Network Sharing Implementation Guide
        • Cross-Account Route 53 Private Hosted Zone Implementation Guide
          • Route 53
            • Routing Policies
              • Using a Weighted Routing Policy
              • Simple Routing Policy
              • Multivalue Answer Routing
            • Latency Routing Policy
            • Route 53 Traffic Flow
        • Direct Connect Gateway Implementation Guide
        • CICD for Lambda
        • AWS IAM Identity Center Integration with Active Directory
        • AWS Transit Gateway Multi-Account Implementation Guide
          • AWS Multi-Account Network Architecture with Infrastructure Account
      • Links
      • Cloud Adoption Framework
      • Data Stores
        • Data Store Types and Concepts in AWS
        • S3
          • Amazon S3 (Simple Storage Service)
            • Bucket Policies
          • Managing Permissions in Amazon S3
          • Amazon Glacier: AWS Archive Storage Service
          • Lab: Querying Data in Amazon S3 with Amazon Athena
          • LAB: Loading Data into a Redshift Cluster
        • Attached Storage
          • EBS
          • AWS Elastic File System (EFS): From Sun Microsystems to Modern Cloud Storage
          • AWS FSx Service Guide
          • Amazon Storage Gateway Guide
        • Databases
          • Amazon Storage Gateway Guide
          • Amazon RDS (Relational Database Service)
          • Aurora DB
          • Dynamo DB
          • Document DB
          • Amazon Redshift Overview
          • Data Pipeline
            • Data Lake VS Lake Formation
          • AWS Data Preparation Services
          • Amazon Neptune
          • Amazon ElastiCache
          • AWS Specialized Database Services
          • LAB - Deploy an Amazon RDS Multi-AZ and Read Replica in AWS
      • Networking
        • Concept
        • Basics
          • VPG
          • VPC
            • VPC endpoints
              • Interface Endpoint VS Elastic Network Interface (ENI)
            • PrivateLink
              • PrivateLink SAAS Use case
            • Transit Gateway
            • 5G Networks
            • VPN CloudHub
            • VPC security
            • VPC peering
            • VPC Endpoint
            • Route Table (and Routers)
            • Network Access Control List (NACL)
            • Network Security Group
            • NAT Gateway
              • NACL vs NAT
          • Elastic Load Balancing (ELB)
            • Gateway Load Balancer (GWLB)
          • CIDR ranges examples
          • Enhanced Networking
          • Elastic Fabric Adapter (EFA)
          • Elastic Network Interface (ENI)
        • Network to VPC Connectivity
          • Transit VS Direct Connect Gateway
          • Direct Connect
            • VIF (Virtual Interfaces)
            • VIF VS ENI
            • Customer Routers VS Customer Gateways
        • VPC-to-VPC
        • NAT & Internet Gateway
        • Routing
          • IPv4 Address Classes and Subnet Masks
          • VPC's DNS server
          • Transit VPC VS Transit Gateway
          • Example Routing tables configuration
          • Cross-regions failover
          • Loopback
        • Enhanced Networking
        • Hybrid and Cross-Account Networking
        • AWS Global Accelerator
        • Route 53
        • Cross-Account Route 53
        • CloudFront SSL/TLS and SNI Configuration
        • ELB
        • Lab: Creating a Multi-Region Network with VPC Peering Using SGs, IGW, and RTs
        • LAB - Creating a CloudFront Distribution with Regional S3 Origins
        • Lab: Creating and Configuring a Network Load Balancer in AWS
        • Lab: Troubleshooting Amazon EC2 Network Connectivity
        • Lab: Troubleshooting VPC Networking
      • Security
        • Cloud Security
          • IAM
            • SCIM
            • Use case 1
          • Core Concepts of AWS Cloud Security
            • OAuth VS OpenID Connect
          • Understanding User Access Security in AWS Organizations
          • Exploring Organizations
          • Controlling Access in AWS Organizations
            • SCP (Service Control Policy) implementation types
        • Network Controls and Security Groups
          • Firewalls
            • Network Controls and Security Groups Overview
          • AWS Directory Services
          • AWS Identity and Access Management (IAM) and Security Services
            • ASW Identity Sources
          • AWS Resource Access Manager (RAM): Cross-Account Resource Sharing
            • AWS App Mesh
        • Encryption
          • History and Modern Implementation of Encryption in AWS
          • Secret Manager
          • DDoS Attacks and AWS Protection Strategies: Technical Overview
          • AWS Managed Security Services Overview
          • IDS and IPS
          • AWS Service Catalog
      • Migrations
        • Migration Concepts
          • Hybrid Cloud Architectures
          • Migration Strategies
        • Migration Application
          • Services and Strategies
          • AWS Data Migration Services
          • Network Migrations and Cutovers
            • Network and Broadcast Addresses
            • VPC DNS
          • AWS Snow Family
      • Architecting to scale
        • Scaling Concepts and Services
          • Auto-Scaling
          • Compute Optimizer
          • Kinesis
          • DynamoDB Scaling
          • CloudFront Part Duex
            • CloudFront's Behavior
            • Lambda@Edge and CloudFront Functions
        • Event-Driven Architecture
          • SNS and Fan-out Architecture
            • SNS & outbox pattern
          • AWS Messaging Services: SQS and Amazon MQ
          • Lab: Scaling EC2 Using SQS
          • Lambda
          • Scaling Containers in AWS
          • Step Function and Batch
          • Elastic MapReduce
          • AWS Data Monitoring and Visualization Services
      • Business Continuity
        • AWS High Availability and Disaster Recovery
        • AWS Disaster Recovery Architectures
        • EBS Volumes
        • AWS Compute Options for High Availability
        • AWS Database High Availability Options
        • AWS Network High Availability Options
        • Lab: Connect Multiple VPCs with Transit Gateway
        • Deployment and Operations Management
          • Software Deployment Strategies
            • AWS CI/CD
            • Elastic Beanstalk
              • Elastic Beanstalk and App Runner
            • CloudFormation
            • Cross-Account Infrastructure Deployment
              • Example Code Pipeline
            • AWS Container Services
            • AWS API Gateway
            • LAB: Understanding CloudFormation Template Anatomy
          • Management Tool
            • Config and OpsWorks
            • System Manager
            • Enterprise Apps
            • AWS Machine Learning Landscape
            • AWS IoT Services
      • Cost Management and Optimization
        • Concepts
        • AWS Cost Optimization Strategies
        • AWS Tagging and Resource Groups
        • Managing Costs Across AWS Accounts
        • AWS Instance Purchasing Options
        • AWS Cost Management Tools
      • Others
        • SCPs vs AWS Config
        • Questions notes
        • Comparison of Deployment Strategies in AWS
        • Bedrock vs EMR
        • Software Deployment Strategies
    • AWS
      • Others
        • AWS Example architectures
          • Gaming application
          • Digital Payment System
            • Marketplace Application
            • Analytics & Reporting System MVP
            • Reporting System 2
            • Data Pipeline
            • Monitoring and visualization solution for your event-driven architecture (EDA) in AWS.
              • Visualize how services are linked together for each business flow
              • Visualize flow and metrics
            • Reporting
            • Data
        • AWS Key Learning
        • AWS NFRs
          • AWS Integration Pattern Comparison Matrix
          • AWS 99.999% Architecture
        • AWS Best Practices
          • use S3 for data migration
          • Principle of centralized control
          • For CPU Spikes in DB use RDS Proxy
          • API Security
          • Lambda VS ECS
          • Use CloudFront for Dynamic content
        • ECS Sizing
        • AWS Q&A
          • AWS Prep
          • prepexam
          • Big Data/ AI Q&A
          • DB Q&A
          • AWS Application Servers Q&A
          • General Q&A
          • VPC Q&A
      • DRs
      • AI, Analytics, Big Data, ML
        • EMR
          • Flink
          • Spark
          • Hadoop
            • Hive
        • Extra
          • Glue and EMR
          • Redshift Use Cases
        • AI
          • Media Services (Elastic Transcoder, Kinesis)
          • Textract
          • Rekognition (part of the exam)
          • Comprehend
          • Kendra
          • Fraud Detector
          • Transcribe, Polly, Lex
          • Translate
          • Time-series and Forecast
        • Big Data
          • Processing & Analytics
            • Amazon Athena VS Amazon Redshift
            • Athena & AWS Glue: Serverless Data Solutions
          • BigData Storage Solutions
          • EMR
        • Business intelligence
        • Sagemaker
          • SageMaker Neo
          • Elastic Inference (EI)
          • Integration patterns with Amazon SageMaker
          • Common Amazon SageMaker Endpoint usage patterns
          • Real-time interfaces
          • ML Example
        • Machine Learning
          • Data Engineering
            • Understanding Data Preparation
            • Feature Engineering: Transforming Raw Data into Powerful Model Inputs
            • Feature Transformation and Scaling in Machine Learning
            • Data Binning: Transforming Continuous Data into Meaningful Categories
          • Exploratory Data Analysis
            • Labs
              • Perform Feature Engineering Using Amazon SageMaker
            • Categorical Data Encoding: Converting Categories to Numbers
            • Text Feature Extraction for Machine Learning
            • Feature Extraction from Images and Speech: Understanding the Fundamentals
            • Dimensionality Reduction and Feature Selection in Machine Learning
          • Modelling
            • Prerequisites for Machine Learning Implementation
            • Classification Algorithms in Machine Learning
            • Understanding Regression Algorithms in Machine Learning
            • Time Series Analysis: Fundamentals and Applications
            • Clustering Algorithms in Machine Learning
      • Databases
        • Capturing data modification events
        • Time-Series Data (Amazon Timestream)
        • Graph DBs
          • Amazon Neptune
        • NoSQL
          • Apache Cassandra (Amazon Keyspaces)
          • Redshift
            • Redshift's ACID compliance
          • MongoDB (Amazon DocumentDB)
          • DynamoDB
            • Additional DynamoDB Features and Concepts
            • DynamoDB Consistency Models and ACID Properties
            • DynamoDB Partition Keys
          • Amazon Quantum Ledger DB (QLDB)
        • RDS
          • DR for RDS
          • RDS Multi-AZ VS RDS Proxy
          • Scaling Relational Databases
          • Aurora Blue/Green deployments
          • Aurora (Provisioned)
          • Amazon Aurora Serverless
        • Sharing RDS DB instance with an external auditor
      • Caching
        • DAX Accelerator
        • ElastiChache
        • CloudFront (External Cache)
        • Global Accelerator (GA)
      • Storages
        • S3
          • MFA Delete VS Object Lock
          • S3 Standard VS S3 Intelligent-Tiering
        • Instance Storage
        • EBS Volumes
          • Burst Capacity & Baseline IOPS
          • Provisioned IOPS vs GP3
          • EBS Multi-Attach
        • Snapshots
        • AWS Backup
        • File Sharing
          • FSx (File system for Windows or for Lustre)
          • EFS (Elastic File System)
      • Migration
        • Migration Hub
        • Application Discovery Service
        • Snow Family
        • DMS
        • SMS (Server Migration Service)
        • MGN (Application Migration Service)
        • Transfer family
        • DataSync
        • Storage Gateway
          • Volume gateway
          • Tape Gateway
          • File Gateway
          • Storage Gateway Volume Gateway VS Storage Gateway File Gateway
        • DataSync VS Storage Gateway File Gateway
      • AWS Regional Practices and Data Consistency Regional Isolation and Related Practices
      • Front End Web application
        • Pinpoint
        • Amplify
        • Device Farm
      • Glossary
      • Governance
        • Well-Architected Tool
        • Service Catalog and Proton
          • AWS Service Catalog
          • AWS Proton
        • AWS Health
        • AWS Licence Manager
        • AWS Control Tower
        • AWS Trusted Advisor
        • Saving Plans
        • AWS Compute Optimizer
        • AWS CUR
        • Cost Explorer and Budgets
        • Directory Service
        • AWS Config
        • Cross-Account Role Access
        • Resource Access Manager (RAM)
        • Organizations, Accouts, OU, SCP
      • Automation
        • System Manager (mainly for inside EC2 instances)
        • Elastic Beanstalk (for simple solutions)
        • IaC
          • SAM
          • CloudFormation
            • !Ref VS !GetAtt
            • CloudFormation examples
      • Security
        • Identity Management Services
          • IAM
            • Identity, Permission, Trust and Resource Policies
              • IAM Policy Examples
              • Trust policy
            • IAM roles cannot be attached to IAM Groups
            • AWS IAM Policies Study Guide
            • Cross-Account Access in AWS: Resource-Based Policies vs IAM Roles
            • EC2 instance profile VS Trust policy
          • Cognito
        • STS
        • AI based security
          • GuardDuty
          • Macie (S3)
        • AWS Network Firewall
        • Security Hub
        • Detective (Root Cause Analysis)
        • Inspector (EC2 and VPCs)
        • System Manager Parameter Store
        • Secret Manager
          • Secret Manger VS System Manager's Parameter Store
          • Secret Manager VS AWS KMS
        • Shield
          • DDoS
        • KMS vs CloudHSM
        • Firewall Manager
        • AWS WAF
      • Compute
        • Containers
          • ECS
            • ECS Anywhere
          • EKS
            • EKS Anywhere
          • Fargate
            • ECS Fargate VS EKS Fargate
          • ECR (Elastic Container Registry)
        • EC2
          • EC2 Purchase Options
            • Spot instances VS Spot Fleet
          • EC2 Instance Types
            • T Instance Credit System
          • Auto Scaling Groups (ASG)
          • Launch Template vs. Launch Configuration
          • AMI
          • EC2 Hibernation
        • Lambda
          • Publish VS deploy
      • Data Pipeline
      • ETL
      • AppFlow
      • AppSync
      • Step Functions
      • Batch
        • Spring Boot Batch VS AWS Batch
      • Decoupling Workflow
      • Elastic Load Balancers
      • Monitoring
        • OpenSearch
        • CloudWatch Logs Insights VS AWS X-Ray
        • QuickSight
        • Amazon Managed Service for Prometheus
        • Amazon Managed Grafana
        • CloudWatch Logs Insights
          • CloudWatch Logs Insights VS Kibana VS Grafana
        • CloudWatch Logs
        • CloudTrail
        • CloudWatch
        • X-Ray
      • On-Premises
        • ECS/EKS Anyware
        • SSM Agent
      • Serverless Application Repository
      • Troubleshooting
      • Messaging, Events and Streaming
        • Kinesis (Event Stream)
        • EventBridge (Event Router)
          • EventBridge Rule Example
          • EventBridge vs Apache Kafka
          • EventBridge VS Kinesis(Event Stream)
          • Event Bridge VS SNS
        • SNS (Event broadcaster)
        • SQS (Message Queue)
        • MSK
        • Amazon MQ
        • DLQ
    • Software Design
      • CloudEvents
        • CloudEvents Kafka
      • Transaction VS Operation DBs
      • Event-based Microservices
        • Relations database to event messages
      • Hexagonal Architecture with Java Spring
      • Distributed Systems using DDD
        • Scaling a distributed system
        • Zookeeper
        • Aggregates
        • Bounded Context
      • API Gateway
      • Cloud
        • The Twelve Factors
        • Open Service Broker API
      • Microservices
    • Design technique
    • Technologies
      • Kafka
      • Docker
        • Docker Commands
        • Artifactory
        • Dockerfile
      • ReactJs
        • Progressive Web App (PWA)
        • Guide to File Extensions in React Projects
    • Guides
      • OCP
      • AWS
        • Creating and Assuming an Administrator AWS IAM Role
        • Standing Up an Amazon Aurora Database with an Automatically Rotated Password Using AWS Secrets Manag
        • Standing Up an Apache Web Server EC2 Instance and Sending Logs to Amazon CloudWatch
        • Creating a Custom AMI and Deploying an Auto Scaling Group behind an Application Load Balancer
        • Assigning Static IPs to NLBs with ALB Target Groups
        • Hosting a Wordpress Application on ECS Fargate with RDS, Parameter Store, and Secrets Manager
        • Amazon Athena, Amazon S3, and VPC Flow Logs
      • Creating a CloudTrail Trail and EventBridge Alert for Console Sign-Ins
      • Load Balancer VS Reverse Proxy
      • Health check
      • Load Balancer
      • HTTP Protocol
      • TCP/IP Network Model
      • Event-base Microservices Implementation Guideline
      • How to write a service
      • Observability
      • Kafka Stream
      • Security
        • Securing Properties
          • HashiCorp Vault
      • Kubernates
      • Unix
        • Networking
        • Firewall
        • File system
        • alternatives
      • Setup CentOS 8 and Docker
    • Dev Tools
      • Docker Commands
      • Intellij
      • CheatSheets
        • Unix Commands
        • Vim Command
      • Templates
  • Working for an enterprise
    • Next step
    • Job roles
      • SME role
    • Common issues
Powered by GitBook
On this page
  • Missing Data Analysis and Treatment
  • Text Data Processing
  • Data Formatting and Normalization
  • Outlier Detection and Treatment
  • Handling Imbalanced Datasets
  • Data Labeling
  • Conclusion
  • Code Examples

Was this helpful?

  1. Software development
  2. AWS
  3. AI, Analytics, Big Data, ML
  4. Machine Learning
  5. Data Engineering

Understanding Data Preparation

PreviousData EngineeringNextFeature Engineering: Transforming Raw Data into Powerful Model Inputs

Last updated 6 months ago

Was this helpful?

Data preparation is a crucial step in any data science or machine learning project. The quality of your data directly impacts the performance of your models and the reliability of your insights. This guide explores key concepts and techniques in data preparation, from handling missing values to addressing imbalanced datasets.

Missing Data Analysis and Treatment

Missing data is a common challenge in real-world datasets. Understanding the mechanism behind missing data is crucial for choosing the appropriate treatment method.

Types of Missing Data

  1. Missing Completely At Random (MCAR) When data is MCAR, the missing values have no relationship with any other variables in the dataset. For example, if a sensor randomly malfunctions, the missing readings would be MCAR. In this case, simple imputation methods are appropriate:

    • Mean imputation for numerical data

    • Median imputation for skewed distributions

    • KNN imputation for maintaining relationships between variables

  2. Missing At Random (MAR) MAR occurs when the missingness can be explained by other variables in the dataset. For instance, if younger people are less likely to report their income, the missing income data is MAR. Multiple Imputation by Chained Equations (MICE) is particularly effective for MAR data as it:

    • Creates multiple complete datasets

    • Accounts for uncertainty in imputations

    • Preserves relationships between variables

  3. Missing Not At Random (MNAR) MNAR occurs when the missingness depends on the missing values themselves. For example, if people with high incomes are less likely to report their income. MNAR requires specialized approaches:

    • Selection models that explicitly model the missing data mechanism

    • Pattern-mixture models that stratify data based on missing patterns

    • Shared parameter models that link the measurement and missing data processes

Text Data Processing

Text data often contains noise that can hinder model performance. Stop words are a prime example of such noise.

Stop Words Removal

Stop words are common words like "the," "is," and "at" that carry little meaningful information. Removing them:

  • Reduces the dimensionality of the text data

  • Improves processing speed

  • Helps focus on meaningful content

Both NLTK and spaCy libraries offer comprehensive stop word lists, but you can also customize these lists based on your specific needs. For instance, in a technical document analysis, words like "figure" or "table" might be considered stop words.

Data Formatting and Normalization

Raw data often contains irregularities and inconsistencies that need addressing.

Common Data Format Issues

Several factors can lead to data corruption:

  • User input errors (e.g., incorrect date formats)

  • System synchronization issues

  • Processing errors during data transfer

NumPy and Pandas provide robust tools for handling these issues. For complex cases, custom Python functions can be developed to address specific formatting requirements.

Data Scaling

Different features often have different scales, which can bias machine learning models. Common scaling techniques include:

  • StandardScaler: Transforms data to have zero mean and unit variance

  • MinMaxScaler: Scales data to a fixed range, usually [0,1]

  • RobustScaler: Scales data using statistics that are robust to outliers

Outlier Detection and Treatment

Outliers are data points that significantly deviate from the general pattern of the dataset.

Detection Methods

  1. Z-score Method This method assumes normal distribution and flags points that are several standard deviations away from the mean:

    Z-score = (x - mean) / standard deviation

    Points with |Z-score| > 3 are typically considered outliers.

  2. Interquartile Range (IQR) Method This method is more robust to non-normal distributions:

    IQR = Q3 - Q1
    Lower bound = Q1 - 1.5 * IQR
    Upper bound = Q3 + 1.5 * IQR

Outlier Treatment

Not all outliers are errors - some may represent genuine rare events. Before treating outliers:

  • Consult domain experts

  • Understand the business context

  • Consider the impact on your analysis

Handling Imbalanced Datasets

An imbalanced dataset occurs when one class significantly outnumbers others, which can lead to biased models.

Addressing Imbalance

  1. Sampling Techniques

    • Undersampling: Reducing majority class samples

    • Oversampling: Increasing minority class samples Both approaches have drawbacks: undersampling may lose important information, while oversampling may lead to overfitting.

  2. SMOTE (Synthetic Minority Over-sampling Technique) SMOTE generates synthetic samples for the minority class by:

    • Identifying k-nearest neighbors for minority class samples

    • Creating new samples along the lines connecting these points This approach helps avoid overfitting while balancing the dataset.

Data Labeling

Data labeling is crucial for supervised learning tasks but can be resource-intensive.

AWS Solutions

  1. Amazon SageMaker Ground Truth

    • Self-service model

    • Flexible labeling workflows

    • Integration with AWS ecosystem

  2. Amazon SageMaker Ground Truth Plus

    • Managed service model

    • 40% cost reduction compared to self-service

    • Professional labeling workforce

Conclusion

Effective data preparation is fundamental to successful data science projects. While it may be time-consuming, proper data preparation:

  • Improves model performance

  • Reduces training time

  • Leads to more reliable insights

  • Prevents garbage-in-garbage-out scenarios

Remember that data preparation is not a one-size-fits-all process. Each dataset and project may require different combinations of these techniques, and the choice of methods should be guided by both statistical principles and domain knowledge.

Code Examples

import numpy as np
import pandas as pd
import missingno as msno
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords
import spacy
from imblearn.over_sampling import SMOTE

# 1. Missing Data Analysis and Handling
def analyze_missing_data(df):
    """
    Visualize and analyze missing data patterns
    """
    # Visualize missing data patterns
    msno.matrix(df)
    msno.heatmap(df)
    
    # Calculate missing percentages
    missing_percentages = df.isnull().sum() / len(df) * 100
    return missing_percentages

def handle_missing_data(df, numerical_columns, categorical_columns):
    """
    Handle missing data based on missingness mechanism
    """
    # MCAR: Simple imputation for numerical data
    num_imputer = SimpleImputer(strategy='mean')
    df[numerical_columns] = num_imputer.fit_transform(df[numerical_columns])
    
    # MAR: KNN imputation for numerical data
    knn_imputer = KNNImputer(n_neighbors=5)
    df[numerical_columns] = knn_imputer.fit_transform(df[numerical_columns])
    
    # Mode imputation for categorical data
    cat_imputer = SimpleImputer(strategy='most_frequent')
    df[categorical_columns] = cat_imputer.fit_transform(df[categorical_columns])
    
    return df

# 2. Text Processing and Stop Words
def process_text(text, custom_stop_words=None):
    """
    Process text data and remove stop words
    """
    # Download required NLTK data
    nltk.download('stopwords')
    
    # Get default stop words and add custom ones
    stop_words = set(stopwords.words('english'))
    if custom_stop_words:
        stop_words.update(custom_stop_words)
    
    # Remove stop words
    words = text.lower().split()
    filtered_words = [word for word in words if word not in stop_words]
    
    return ' '.join(filtered_words)

# 3. Outlier Detection and Handling
def detect_outliers(data):
    """
    Detect outliers using Z-score and IQR methods
    """
    # Z-score method
    z_scores = (data - np.mean(data)) / np.std(data)
    z_score_outliers = np.abs(z_scores) > 3
    
    # IQR method
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    IQR_outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))
    
    return z_score_outliers, IQR_outliers

# 4. Handle Imbalanced Data
def balance_dataset(X, y):
    """
    Balance dataset using SMOTE
    """
    smote = SMOTE(random_state=42)
    X_balanced, y_balanced = smote.fit_resample(X, y)
    
    return X_balanced, y_balanced

# 5. Data Formatting and Scaling
def format_and_scale_data(df, numerical_columns):
    """
    Format and scale numerical data
    """
    # Handle data type conversions
    for col in numerical_columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Scale numerical features
    scaler = StandardScaler()
    df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
    
    return df

# Example usage
if __name__ == "__main__":
    # Load sample data
    df = pd.DataFrame({
        'num_feature': [1, 2, np.nan, 4, 100],
        'cat_feature': ['A', None, 'B', 'A', 'C'],
        'text_feature': ['This is a sample text', 'Another example with stop words', 'More text data']
    })
    
    # Analyze missing data
    missing_analysis = analyze_missing_data(df)
    
    # Handle missing data
    df = handle_missing_data(df, 
                           numerical_columns=['num_feature'],
                           categorical_columns=['cat_feature'])
    
    # Process text
    df['text_feature'] = df['text_feature'].apply(process_text)
    
    # Detect outliers
    z_outliers, iqr_outliers = detect_outliers(df['num_feature'])
    
    # Format and scale data
    df = format_and_scale_data(df, numerical_columns=['num_feature'])