As a skilled data scientist at Microsoft, you play a critical role in protecting millions of Windows users from the ever-evolving threat of malware. The malware landscape is constantly changing, with new viruses and attack vectors emerging every day. To stay ahead of these threats, Microsoft needs to leverage advanced machine learning techniques to identify and neutralize malware before it can cause harm. You are working in Microsoft's defense, and now you must prepare for the future of coding.
You are given a massive and complex dataset containing telemetry data from Windows machines worldwide. Your task is to build a robust and accurate classification model that can predict the likelihood of a machine being infected with malware based on its configuration, usage patterns, and other relevant factors. This model will be integrated into Microsoft's endpoint protection solution, Windows Defender, to proactively identify and block malware attacks, safeguarding the security and privacy of millions of users around the globe. With your skills, you will continue to protect their millions of users against any attacks from all corners.
Goal: The goal of this project is to build a model that can predict whether a Windows machine will be infected with malware.
The dataset contains telemetry data and machine properties, used to predict the probability of a machine getting infected by malware.
MachineIdentifier: Individual machine IDProductName: Defender state informationEngineVersion: Defender state informationAppVersion: Defender state informationAvSigVersion: Defender state informationIsBeta: Defender state informationRtpStateBitfield: NAIsSxsPassiveMode: NADefaultBrowsersIdentifier: ID for the machine's default browserAVProductStatesIdentifier: ID for antivirus software configurationAVProductsInstalled: NAAVProductsEnabled: NAHasTpm: True if machine has tpmCountryIdentifier: ID for the country the machine is located inCityIdentifier: ID for the city the machine is located inOrganizationIdentifier: ID for the organization the machine belongs toGeoNameIdentifier: ID for the geographic region a machine is located inLocaleEnglishNameIdentifier: English name of Locale IDPlatform: Platform nameProcessor: Processor architectureOsVer: Version of the current operating systemOsBuild: Build of the current operating systemOsSuite: Product suite maskOsPlatformSubRelease: OS Platform sub-releaseOsBuildLab: Build lab that generated the current OSSkuEdition: SKU-Edition nameIsProtected: Whether a machine is protectedAutoSampleOptIn: SubmitSamplesConsent valuePuaMode: Pua Enabled modeSMode: Field for S modeIeVerIdentifier: NASmartScreen: SmartScreen enabled string valueFirewall: Windows firewall is enabledUacLuaenable: Attribute that reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UACCensus_MDC2FormFactor: Device census level hardware characteristicsCensus_DeviceFamily: Device typeCensus_OEMNameIdentifier: NACensus_OEMModelIdentifier: NACensus_ProcessorCoreCount: Number of logical cores in the processorCensus_ProcessorManufacturerIdentifier: NACensus_ProcessorModelIdentifier: NACensus_ProcessorClass: Processor classificationCensus_PrimaryDiskTotalCapacity: Amount of disk space on primary diskCensus_PrimaryDiskTypeName: Primary Disk TypeCensus_SystemVolumeTotalCapacity: Size of the system volume partitionCensus_HasOpticalDiskDrive: True if machine has an optical disk driveCensus_TotalPhysicalRAM: Physical RAMCensus_ChassisTypeName: Type of chassisCensus_InternalPrimaryDiagonalDisplaySizeInInches: Physical diagonal length in inches of the primary displayCensus_InternalPrimaryDisplayResolutionHorizontal: Pixel resolution in the horizontal directionCensus_InternalPrimaryDisplayResolutionVertical: Pixel resolution in the vertical directionCensus_PowerPlatformRoleName: Power management profileCensus_InternalBatteryType: NACensus_InternalBatteryNumberOfCharges: NACensus_OSVersion: Numeric OS versionCensus_OSArchitecture: Architecture on which the OS is basedCensus_OSBranch: Branch of the OSCensus_OSBuildNumber: OS Build numberCensus_OSBuildRevision: OS Build revisionCensus_OSEdition: Edition of the current OSCensus_OSSkuName: OS edition friendly nameCensus_OSInstallTypeName: Description of the installCensus_OSInstallLanguageIdentifier: NACensus_OSUILocaleIdentifier: NACensus_OSWUAutoUpdateOptionsName: Windows Update auto-update settingsCensus_IsPortableOperatingSystem: True if OS is booted from USBCensus_GenuineStateName: OSGenuineStateIDCensus_ActivationChannel: License keyCensus_IsFlightingInternal: NACensus_IsFlightsDisabled: If machine is participating in flightingCensus_FlightRing: Ring the device user receives flights forCensus_ThresholdOptIn: NACensus_FirmwareManufacturerIdentifier: NACensus_FirmwareVersionIdentifier: NACensus_IsSecureBootEnabled: Secure Boot mode is enabledCensus_IsWIMBootEnabled: NACensus_IsVirtualDevice: Identifies a Virtual MachineCensus_IsTouchEnabled: Is this a touch device?Census_IsPenCapable: Is the device capable of pen input?Census_IsAlwaysOnAlwaysConnectedCapable: battery statusWdft_IsGamer: Is this a gamer deviceWdft_RegionIdentifier: NAHasDetections: the target variableData Source: Kaggle Microsoft Malware Prediction
Download DataYour task is to build a classification model to predict malware infection.
Python Libraries: Pandas, NumPy, scikit-learn, Matplotlib/Seaborn, (potentially) XGBoost/LightGBM due to the size and complexity of the data.