Activity
Mon
Wed
Fri
Sun
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
What is this?
Less
More

Memberships

Learn Microsoft Fabric

12.7k members • Free

7 contributions to Learn Microsoft Fabric
Concurrency issue when orchestrating notebooks in a DAG
I'm using the mssparkutils.notebook.runMultiple function to orchestrate a number of notebooks from one master notebook. I've set the concurrency property to 6, which to my understanding simply allows up to 6 notebooks to run in parallel at any time, as long as their dependencies are all met. Each notebook individually runs just fine when executed outside of the DAG I've set up, however when executed inside the DAG I keep getting a consistent error around files being added to a partition of the delta table they update. Along the lines of "Files were added to partition by a concurrent update". Is there anything different about orchestrating notebooks in parallel in a DAG vs. not? To be clear absolutely none of these notebooks update the same delta table. Each notebook results in an update to it's own delta table.
Hello
Hi, I am starting to evaluate Fabric as a potential upgrade to a reporting solution currently built in Synapse. I will have to do some rework anyway as Microsoft have dropped "Export to Datalake" in D365 F&O in favour of Synapse Link. I created a traditional data warehouse solution in Synapse, building dimension and fact tables in a medallion architecture. I also have Dev/Test/Prod environments, all parameterised so that I can move notebooks/pipelines using release pipelines in DevOps. All working reasonably well. In my evaluation of Fabric, I can get individual notebooks to run, but parameterising them seems to be a lot harder. I would see that medallion is using a couple of Lakehouses and then a Warehouse as the Gold layer. I would want to be able to transform from one Lakehouse to the next and then to the Warehouse, but connecting to each doesn't seem to be as easy as I might like. I know I am missing something. I have watched quite a few YouTube videos, but there isn't much coverage yet of that area. Does anyone know of any videos covering this kind of thing?
1 like • May '24
@Graham Cottle Wrapping say...10 notebooks inside one main notebook also helps with run times. I believe there isn't a high concurrency style mode in Pipelines yet for Notebooks, this means that if you chain 20 notebooks together in series, each has a spin up time for the spark pool. Using %run inside one that has 20 code cells with the sub-notebooks uses a single spin up and then it's good to go. I think it's on the Fabric release plan to have high concurrency for Notebooks in pipeline runs so you save this lost time.
0 likes • May '24
@Graham Cottle Depending on how you manage your Fabric environment from a Spark Compute perspective, the benefit of this approach is you could also chain them together from the perspective of tasks requiring higher node memory + number of nodes vs. lower. And then chain them according to this. Currently the chaining I have done is merely based on splitting tasks up in an arbitrary way.
Whitelisting IP Addresses for Spark Notebooks
HI everyone, Is it currently possible to whitelist IP addresses to allow Spark Notebooks to access a REST API? If not, does anyone know if this feature is in development for Fabric pipelines? Thanks! Erki
1 like • May '24
@Erki Ohmann What platform are you attempting to whitelist the IP on? Spark notebooks DO have an IP address associated with them. Currently working with a client where an individual in my business created a security rule in Azure where an IP can dynamically be whitelisted when a notebook runs to get into an Azure hosted VNET to access a SQL Server. Client is running below F64 SKU so Private Endpoint not an option.
1 like • May '24
@Erki Ohmann @Will Needham we've deployed something, but it's personally not the best and most robust solution. What the below is doing is attempting to get the IP address by using external sites and then store as a variable. We added the second try catch due to the first site failing on occasion. It's possible both could fail and not be usable. This is then paired with a network security rule in Azure which was created just for this SQL data load process. You may need to implement a similar thing for your REST API. You must know that we do this every time due to the chance IPs change over time. We can't predict this. # Retrieve IP address of current session # Try first using v4.ident.me try: spark_session_ip = urllib.request.urlopen('http://v4.ident.me').read().decode('utf8.read().decode('utf8)') print("http://v4.ident.me website IP address retrieval failed. Attempting to use http://ip.jsontest.com/ instead") except: try: with urllib.request.urlopen("http://ip.jsontest.com/") as response: data = json.loads(response.read().decode()) spark_session_ip = data["ip"] except: print("http://ip.jsontest.com/ IP address load failed") exit()
Notebook to DWH not possible?
Hi everyone, does anyone know why it's (technically) not possible to use a notebook to store a table into a DWH? I have read this several times and don't understand the technical obstacle to do so.
1 like • May '24
@Vinayak K Fantastic summary
Reduce Fabric CU usage and cut costs with a hybrid PBI Pro & F-SKU workspace model
The licensing for Microsoft Fabric can be confusing to newcomers, especially when you are introducing Fabric as your in-house ETL tool whilst migrating existing PBI reports to the platform. Below is some information that may be most relevant and useful for individuals running an "F" SKU below F64, or that do not use Fabric scaled to F64 all the time and also possess Power BI Pro Licenses. "Why is F64 relevant?" - When running workspaces licensed to Fabric on F64 or above, individuals can consume Power BI reports using a Fabric Free license. I.e. you do not need to purchase a Pro license for such users. "But I don't have a need for F64. My user base is also far below the threshold to make it worth my while to scale up to F64 to save on the Pro license costs!" - Also fine. This post is for you specifically, and intended to advise how to keep your costs as low as possible as you grow, with being smart about your workspace layout. Imagine you're a smaller use-case business, with only 40 Power BI users with Pro licenses to consume their PBI content. Maybe in addition, the Fabric-specific content you are creating (Pipelines, Notebooks, Data Flows) requires only F4 / F8. Far below the threshold to be thinking about scaling up to F64. Maybe also you only use Fabric once per day to run ETL pipelines, at which point the only use of the SKU is for users consuming those PBI reports. You can cut your Fabric Capacity Unit usage if you do the following: 1. Keep your PBI Semantic Models and Reports in a separate workspace to your Lakehouse / Warehouse / Notebooks essentially split the PBI content from the Fabric content. 2. License the Fabric workspace to Fabric, and for the PBI content workspace, use the Pro License method. What does this do? Splitting your content in this manner means that the PBI report consumption will not count towards usage in the Fabric workspace. This means you can switch off the workspace entirely, thus ultimately cutting costs. You're fully utilising your Pro Licenses to host the PBI workspace, and don't have to run Fabric all day long.
1-7 of 7
Liam Shropshire-Fanning
2
9points to level up
@liam-shropshire-fanning-8829
Head of Architecture | Qualified Accountant | Data Engineer | Power 365 Solutions

Active 310d ago
Joined May 23, 2024
Powered by